Friday, December 29, 2023

Understanding the Importance of Data Lineage in Modern Data Management

Data lineage is a component of modern data management that helps organizations understand the origins, transformations, and movement of their data. It is like a road map that shows us where our data has been, how it has changed, and where it is going, just like tracking the journey of a package: from the person who sent it (the source) to the places it passes through, and finally to the person who receives it.

The concept of data lineage has been around for many years, but it has become increasingly important in recent years due to the growth of big data and the increasing complexity of data processing systems.

One of the earliest examples of data lineage can be found in the work of Dr. Donald Knuth, who developed a system called TeX in the 1970s. TeX is a typesetting system that uses a complex set of rules to generate high-quality printed output. Knuth developed a system called Metafont to track the flow of data through TeX. Metafont allowed Knuth to identify and fix errors in his typesetting rules.

In the 1980s, data lineage began to be used in the financial services industry to track the flow of financial transactions. This was due to the increasing need for financial institutions to comply with regulations such as the Gramm-Leach-Bliley Act.

In the 1990s, data lineage began to be used in the healthcare industry to track the flow of patient data. This was due to the increasing need for healthcare providers to comply with regulations such as the Health Insurance Portability and Accountability Act (HIPAA).

In today’s data-driven world, organizations are collecting and storing more data than ever before. This data is important for making informed business decisions, fueling innovation, and enabling organizations to better understand their customers and markets.

It provides a detailed record of data’s path, showcasing how it is created, modified, used, and transformed throughout various processes within an organization. Data lineage helps understand the data’s quality, lineage, and business relevance. It can be represented in graphical or tabular formats, allowing stakeholders to comprehend complex data relationships and dependencies easily.

Data lineage has three main components that help us understand how it works. They include:

Thursday, December 28, 2023

Eager Aggregation in SQL queries

Aggregation is a widely used way to summarize the content of a database. It is usually expressed with GROUP BY clause or just using aggregate functions (like COUNT or SUM). When the database engine executes a query with aggregations, it produces individual rows need to compute the required output and then performs the aggregation as (almost) last step. We discuss in this article how to re-write a query manually so that the order of operations will be different and when it can be beneficial.

We start with remainder that SQL is a declarative language, that is, a properly written query specifies what should be included into result but does not specify how to calculate this result. There are several ways (called execution plans) to do that for almost any query. All execution plans for a query produce same results but may utilize different amounts of computing resources. An optimizer tries to choose the best plan for execution. Usually, state-of-the-art optimizers do their job well but sometimes they fail to choose a good plan. This may happen for different reasons:

  • The data statistics and/or cost model are imprecise.
  • The optimizer does not consider some classes of plans.

In this article we discuss one type of query transformation that most optimizers do not use. Because of this, it can be beneficial for you to rewrite a query to help the optimizer order operations in a way that can be beneficial.

Tuesday, December 26, 2023

Introducing the MongoDB Document

MongoDB is a document database. As such, the data is stored as individual documents. A document is a data structure made up of one or more field/value pairs. Nearly everything you do in MongoDB is either directly or indirectly related to the documents that you store in a database or move in and out of a database. The better you understand how documents work, the more effectively you can write queries and manage the data.

In my previous article, which was the first in this series, I introduced you to MongoDB and described how to get started with MongoDB Atlas and MongoDB Compass. Atlas provides a cloud-based database service comparable to on-premises MongoDB, and Compass serves as a client interface for connecting to MongoDB and working with document data.

As part of this discussion, I also covered some of the basics of MongoDB documents, demonstrating how to create them and add them to your database. In this article, I expand on this discussion to give you a better sense of how documents are constructed and the different types of data they can contain. To help with this process, I provide several examples that demonstrate some of the ways you can define documents to meet your specific business needs.

Note: For the examples in this article, I used the same MongoDB Atlas and MongoDB Compass environments I set up for the first article. If you want to try out these examples and are uncertain how to connect to Atlas, refer to the first article for more information.

Thursday, December 21, 2023

Counting more efficiently

Nearly a decade ago, I wrote a post called “Bad habits : Counting rows the hard way.” In that post, I talked about how we can use SQL Server’s metadata to instantly retrieve the row count for a table. Typically, people do the following, which has to read the entire table or index:

DECLARE @c int = (SELECT COUNT(*) FROM dbo.TableName);

To largely avoid size-of-data constraints, we can instead use sys.partitions.

Note: We could use OBJECT_ID() instead of a join, but that function doesn’t observe isolation semantics, so can cause blocking – or be a victim. A potential compromise is to create standardized metadata views, but I’ll leave that as an exercise for the reader.
DECLARE @object int = (SELECT o.object_id 
    FROM sys.objects AS o
      INNER JOIN sys.schemas AS s
        ON o.[schema_id] = s.[schema_id]
      WHERE o.name = N'TableName'
        AND s.name = N'dbo');

 DECLARE @c int = (SELECT SUM([rows]) 
    FROM sys.partitions
      WHERE index_id IN (0,1)
        AND object_id = @object);

That’s great when you want to count the whole table without size-of-entire-table reads. It gets more complicated if you need to retrieve the count of rows that meet – or don’t meet – some criteria. Sometimes an index can help, but not always, depending on how complex the criteria might be.

Monday, December 18, 2023

Using a SQL Tokenizer

Imagine this: you have several directories full of SQL script files, and you need to know where a certain table is used. You’d rather like the context too, so you can check the whole SQL Expression and work out why it is running so slowly. Maybe, from that same daunting set of several directories, you need to search for a comment, either end of line, or block comment, perhaps structured. It could be that you just need to execute each query or statement in turn to check performance.

It’s not unusual to want to search just within strings. Although simple searches can be effective, you will at some point need a tool that is able to recognise and return a collection of strings representing the SQL code, divided up into the respective components of the SQL language.

For this article, we’ll use a PowerShell cmdlet, called Tokenize-SQLString, which is in my GitHub repository.

@'
/* Select * from dbo.othertable */
drop view if exists dbo.Book_Purchases_By_Date; 
--drop the Book_Purchases_By_Date view
Select 'create table dbo.YetAnothertable'
'@|Tokenize-SQLString|Out-GridView

This will provide the following stream of objects and finds the reference:

Using Spark Jobs for Multiple Lakehouse Maintenance in Microsoft Fabric

I have published videos and articles before about Lakehouse maintenance. In this article I want to address a missing point for a lot of Fabric administrators: How to do maintenance on multiple lakehouses that are located in different workspaces.

One of the videos I have published explains the maintenance of multiple lakehouses, but only addresses maintenance in a single workspace. Is it a good idea to keep multiple lakehouses in the same workspace? Probably not.

The main maintenance tasks you would generally execute on Lakehouses are VACUUM and OPTIMIZE. I already went in details about them on the videos and articles linked below.

In the future, depending on Microsoft Fabric advances on the enterprise, many more maintenance tasks may also become needed.

Notebooks can’t be used for this maintenance because of one principle of the technology at the moment:

Notebooks can’t loop through lakehouses in different workspaces

Thursday, December 14, 2023

MySQL Error Log Management in DevOps Operations

When it comes to the development and operations (DevOps), one thing stands out as a critical aspect and that is troubleshooting. The primary goal of a DevOps team is to ensure that the product experiences zero to no downtime because every moment is crucial. Therefore, smooth delivery and uninterrupted uptime are paramount. To achieve this, DevOps teams monitor the product’s performance using logs, and databases, like MySQL, are no exception to this practice.

MySQL provides error logs that serve as a comprehensive record of significant events. These events include activities like the MySQL server starting up or shutting down and any critical issues that arise while it’s running. These logs are indispensable diagnostic tools that are super helpful for database administrators and experts to quickly pinpoint and address problems. By diving deep into these error logs, they can maintain a stable and efficient database, ensuring its reliability and top-notch performance.

In this article, we’ll explore the world of MySQL error logs. We’ll discuss their importance in DevOps, the challenges they bring, and how to manage them effectively. You’ll be well-equipped to turn these challenges into opportunities to improve your DevOps operations.

Wednesday, December 13, 2023

Fabric Notebooks and Deployment Pipelines

On my article about Fabric source control extended features, I explained how Microsoft included the notebooks on the source control.

In this way we can include notebooks on a Software Development Lifecycle (SDLC) for Power BI objects.

In this way, the notebooks need to flow from the development environment to test and production environments. However, what happens with the references the notebook may contain?

The notebook may contain references to lakehouses and other configurations which may be different on each environment. We need to ensure these references are automatically changed when we promote the notebook to another environment.

Monday, December 11, 2023

Exploring PostgreSQL Indexes

In this blog, we continue our exploration on PostgreSQL indexes which we started here. In that article, we learned what an index is, and how exactly indexes can help with query execution. But there is much more to learn about indexes! In this blog, we will keep exploring B-tree indexes. We will learn whether (and how) database constraints and indexes are related (or not), how exactly index bitmap scan works, and explore some additional index options available in PostgreSQL.

Indexes and Constraints

In the previous article we learned that the most helpful indexes are indexes with the lowest selectivity, which means that each distinct value in an index corresponds to a small number of rows. The smallest number of rows is one, thereby, the most useful indexes are unique indexes.

The definition of a unique index states just that: an index is unique if for each indexed value there is exactly one matching row in the table. PostgreSQL automatically creates a unique index to support any primary key or unique constraint on a table.

Is there any difference between a primary key and a unique constraint? Many SQL developers assume that a primary key must be an incrementing numeric value and that each table “has” to have a primary key. Although it often helps to have a numeric incremental primary key (which in this case is called a surrogate key), a primary key does not have to be numeric, and moreover, it does not have to be a single-attribute constraint. It is possible to define a primary key as a combination of several attributes; it just has to satisfy two conditions: the combination must be UNIQUE and NOT NULL for all participating attributes. In contrast, UNIQUE constraints in PostgreSQL allow for NULL values.

A table can only have a single primary key (though a primary key is not required), but can also have multiple unique constraints. Any non-null unique set of columns can be chosen to be a primary key for a table; thus, there is no programmatic way to determine the best candidate for a table’s primary key.

Note: If you want to try the code, the appendix has the information you need to install the sample database we will be using, as well as some necessary updates before starting with the current article.

For example, the table booking has a primary key on booking_id and a unique key on booking_ref. Note: you do not need to create a unique constraint to create a unique index:

ALTER TABLE booking
  ADD CONSTRAINT booking_pkey PRIMARY KEY (booking_id);

ALTER TABLE booking
  ADD CONSTRAINT booking_booking_ref_key UNIQUE (booking_ref);

In this case, both columns are unique, and moreover, there is no need for a booking_id column at all. Booking reference is generated by any reservation system, and it is always both UNIQUE and NOT NULL. The only reason we added a booking_id field to the booking table was to allow application developers to comply to their development standards.

In the case below, the situation is different. The account table has a primary key account_id. When we add a unique index on frequent_flyer_id column, we want to make sure that there will never be two accounts with the same frequent_flyer_id, however, we will have many accounts for which frequent_flyer_id will be null.

CREATE UNIQUE INDEX account_freq_flyer 
  ON account (frequent_flyer_id);

What about foreign keys? Do they automatically create any indexes? A common misconception is the belief that the presence of a foreign key necessarily implies the presence of an index on the child table. This is not true.

A foreign key is a referential integrity constraint; it guarantees that for each non-null value in the child table (i.e., the table with the foreign key constraint), there is a matching unique value in the parent table (i.e., the table it is referencing). The column(s) in the parent table must have a primary key or unique constraint on them, however, nothing is automatically created on a child table.

This misconception often backfires when we need to delete a row from a parent table. I observed a user wondering why a deletion of one row (supported by a unique index) took over 10 minutes. It turned out that the table was a parent table for thirteen foreign key constraints, and none of them was supported by an index. To make sure that deletion wouldn’t violate any of these integrity constraints, Postgres had to sequentially scan thirteen large tables.

Does it mean that we always should create an index on column(s) which has a foreign key constraint? Not necessarily. We remember that only indexes with reasonably low selectivity are useful. To illustrate what “reasonable” means in this context, let’s consider the following example.

The flight table in the postgres_air database has a foreign key constraint on the aircraft_code:

ALTER TABLE flight
    ADD CONSTRAINT aircraft_code_fk FOREIGN KEY (aircraft_code)
    REFERENCES aircraft (code);

This foreign key constraint is necessary because for each flight, there must be a valid aircraft assigned. To support the foreign key constraint, a primary key constraint was added to the aircraft table. That table, however, has only 12 rows. Therefore, it is not necessary to create an index on the aircraft_code column of the flight table. This column has only 12 distinct values, so an index on that column will not be used.

Let’s examine the execution plan for the following query:

SELECT 
    f.flight_no,
    f.scheduled_departure,
    model,
    count(passenger_id) passengers
FROM flight f
JOIN booking_leg bl ON bl.flight_id = f.flight_id
JOIN passenger p ON p.booking_id=bl.booking_id
JOIN aircraft ac ON ac.code=f.aircraft_code
WHERE f.departure_airport ='JFK'
      AND f.scheduled_departure BETWEEN
          '2023-08-14' AND '2023-08-16'
GROUP BY 1,2,3

The execution plan presented on the Figure 1, and fear not of its size – at this moment, we need just a small portion of it. Pay attention to the lines 19-20:

Hash  (cost=1.12..1.12 rows=12 width=64)
       -> Seq Scan on aircraft ac (cost=0.00..1.12 rows=12

The PostgreSQL optimizer accesses table statistics and is able to detect that the size of the aircraft table is small and index access won’t be efficient. In contrast, the index on departure_airport field of the flight table proved to be useful due to its low selectivity, and you can see it being used in lines 17-18:

->  Bitmap Index Scan on flight_departure_airport 
 (cost=0.00..119.33 rows=10521 width=0)
Index Cond: (departure_airport = 'JFK'::bpchar)

A screenshot of a computer Description automatically generated with medium confidence

Figure 1. Execution plan with sequential scan on a small table

What does “bitmap” mean?

It has been several times already that the term ‘bitmap’ appeared in execution plans. Each time I promised “to talk about it later,” and now this “later” is here.

It is critically important to fully understand how the indexes bitmapping works in PostgreSQL, especially because other major DBMS do not have similar access methods, documentation does not go into many details about it, and there is a lot of confusion among users regarding how it works. Let’s proceed with uncovering this mystery.

Depending on the index selectivity, PostgreSQL utilizes one of two methods. The first is an index scan; the second depends on a bitmap index scan, followed by a bitmap heap scan.

In the previous blog, we discussed how the index scan works. An index entry includes addresses of the records which match the indexed value; this address contains the block address and the offset of the record within the block. In an index scan, the database engine reads each entry of the index that satisfies the filter condition and retrieves blocks in index order. Because the underlying table is a heap, multiple index entries might point to the same block. If an index has a low selectivity, these situations are rare, and there are high chances that no block will be read (or rather attempted to be read) more than once. However, the situation is different when the selectivity is high, multiple indexes are used, or the filtering condition differs from equal. It these cases, a bitmap index scan will be used. This method builds a heap bitmap in memory. The whole bitmap is a single bit array, with as many bits as there are heap blocks in the table being scanned.

Each index used to satisfy selection criteria is used sequentially. To perform a bitmap index scan, first, a bitmap is created with all entries initially set to 0 (false). Whenever an index entry that matches the search condition is found, the bit corresponding to the heap block indicated by the index entry is set to 1 (true). The second and any additional bitmap index scans do the same thing with the indexes corresponding to the additional search conditions. Once all bitmaps have been created, the engine performs a bitwise logical AND operation to find which blocks contain requested values for all selection criteria, producing a final candidate list. This means that blocks that satisfy only one of the two criteria in a logical AND never have to be accessed. An illustration is shown in Figure 2.

Figure 2. Using bitmaps for table access through multiple indexes

Note, that AND is not the only logical operator which can be used on bitmaps. If some conditions in the selection criteria are connected with OR, the logical OR will be applied, and if a query has more search conditions on a single table, more bitmaps can be put to use.

After the final candidate list is computed, the candidate blocks are read sequentially using a bitmap heap scan (a heap scan based on a bitmap), and for each block, the individual records are examined to re-check the search conditions. Note that requested values may reside in different rows in the same block. The bitmap ensures that relevant rows will not be missed but does not guarantee that all scanned blocks contain a relevant row.

In Figure 1, lines 15 and 17 represent two bitmap index scans, line 14 represents bitmap AND, and line 12 shows bitmap heap scan. To summarize, bitmap access method always includes these three steps:

  • Bitmap index scan(s)
  • Logical AND/OR
  • Bitmap Heap scan

Do we need compound indexes?

Same as other RDBMS, PostgreSQL allows compound (or multi-column) indexes. Since PostgreSQL can utilize multiple indexes for the same query using bitmap index scan, is there any reason to create compound indexes? Let’s run some experiments.

SELECT 
      scheduled_departure ,
      scheduled_arrival
FROM flight
WHERE departure_airport='ORD' 
  AND arrival_airport='JFK'
  AND scheduled_departure BETWEEN '2023-07-03' AND '2023-07-04';

The execution plan for this SELECT statement is presented in Figure 3.

A screenshot of a computer Description automatically generated with medium confidence

Figure 3. Execution plan with three indexes

There are three indexes which support all three search criteria, and Postgres takes advantage of all three of them. Let’s create a compound index on all three columns:

CREATE INDEX flight_depart_arr_sched_dep ON  flight(
        departure_airport,
        arrival_airport,
        scheduled_departure);

The new execution plan is presented on Figure 4:

A picture containing text, screenshot, font, line Description automatically generated

Figure 4. Execution plan with compound index

This new compound index will support searches by departure_airport, by departure_airport and arrival_airport, and by departure_airport, arrival_airport, and scheduled_departure. It will not support, however, the searches by arrival_airport or scheduled_departure. Indeed, recall how b-tree indexes are built and how they are used: we rely on the ordering of indexed values. The values of compound indexes are ordered first by the first column, then be the second column, and so on.

The query

SELECT 
      departure_airport,
       scheduled_arrival,
       scheduled_departure
FROM flight
WHERE  arrival_airport='JFK'
       AND scheduled_departure 
          BETWEEN '2023-07-03' AND '2023-07-04'

…will produce the execution plan shown in Figure 5.

A screenshot of a computer Description automatically generated

Figure 5. Compound index is not used

On the other hand, the query

SELECT 
      scheduled_departure ,
       scheduled_arrival
FROM flight
WHERE departure_airport='ORD' AND arrival_airport='JFK'
  AND scheduled_arrival BETWEEN '2023-07-03' AND '2023-07-04';

…will use the compound index, although only for the first two columns, as shown in Figure 6.

A screenshot of a computer Description automatically generated with low confidence

Figure 6. A plan that uses the compound index for the first two columns

In general, an index on (X,Y,Z) will be used for searches on X, (X,Y), and (X,Y,Z) and even (X,Z) but not on Y alone and not on (Y,Z). Thus, when a compound index is created, it’s not enough to decide which columns to include; their order must also be considered.

Why create compound indexes? After all, the previous section demonstrated that using several indexes together will work just fine. Most times, the decision on whether to create a compound index is based on whether they can provide additional selectivity and improve performance of critical queries. I will cover other potential advantages of compound indexes in the next blog.

Summary

PostgreSQL has multiple ways to index the tables, and knowing what available options are can dramatically improve an application performance. Remember, that so far, we only talked about b-tree indexes, and we are not done with them yet, and we didn’t even start talking about other index types!

Did you have fun exploring indexes in PostgreSQL? Would you like to learn more about indexing? Look for the next up in this series.

Appendix – Setting up the training database

If you want to repeat the experiments in this article, the instructions for installing the base database can be found in the Appendix of the first article in this series. If you are starting here, use those instructions to download the database. Once you have downloaded and restored the database, you will need to create the following additional indexes on your copy of the postges_air database:

SET search_path TO postgres_air;
CREATE INDEX flight_arrival_airport ON flight  (arrival_airport);
CREATE INDEX booking_leg_flight_id ON booking_leg  (flight_id);
CREATE INDEX flight_actual_departure ON flight  (actual_departure);
CREATE INDEX boarding_pass_booking_leg_id ON boarding_pass  (booking_leg_id);
CREATE INDEX booking_update_ts ON booking  (update_ts);

Don’t forget to run ANALYZE on all the tables for which you build new indexes:

ANALYZE  flight ;
ANALYZE  booking_leg;
ANALYZE  booking;
ANALYZE boarding_pass;

 

The post Exploring PostgreSQL Indexes appeared first on Simple Talk.



from Simple Talk https://ift.tt/oECWtcV
via

Friday, December 8, 2023

What Counts For a DBA: Duty

It has been a long month. We had the PASS Data Community Summit just a month ago, and I came home with a hitchhiking virus (no, not that one). I dragged through Thanksgiving, got better, and went on vacation. I got back home from vacation, and, well, I was sick again.

Looking at my calendar this morning, I realized it was time for the Simple-Talk editorial to go in the newsletter. As I searched my lists of possible topics I had accumulated, none suited how I felt. Then it hit me, “duty”. This editorial needs to be written, and it is my duty to make that happen.

The topic reminded me of the years I was on call. Anytime you are in an admin/support role, there will be times when you just have to work through it. Many nights, the phone would start ringing (I am a heavy sleeper, so I made an extra loud ringtone from the Agents of Shield Theme song, starting at 31 seconds in). I typically felt a lot like I do now sitting there at 3 AM, feeling sort of sick, tired, confused, staring at the screen, trying to figure out how to get started on the task at hand. Then the questions arise. Why am I sitting in front of this screen with my ears stuffy, half awake, trying to figure out some bug that I usually didn’t cause (it was usually, but not always, the fault of front-end software that didn’t use enough constraints.) Why is this my job? What am I doing with my life? (Stuff comes up when you are shocked awake, especially in the middle of the night.)

Of course, even if it was painful, it did make some sense, which I reminded myself of regularly. I was on the reporting team, and our daily analysis of activity was the only critical process that needed to be completed by the morning so upper management could make decisions on how to direct the next day’s business. We took pride in making our software get the most correct answer, and the price for this was that two thirds of the on-call activity was dealing with non-conforming data that we would just have to deal with manually. All the team members took turns, so it wasn’t that much of a burden, but that was never how it felt when the sun was still hiding on the other side of the planet and I was sitting behind my keyboard trying to keep my eyes open.

In the case of my job here as editor, well, I could almost certainly ask for help, but I prefer to do my duty and make sure I do the essential work if I am able.

The post What Counts For a DBA: Duty appeared first on Simple Talk.



from Simple Talk https://ift.tt/BNoJIZX
via

Thursday, December 7, 2023

AWS Step Functions in C# – Part 2

In part one of this series, I built a state machine which runs in the background to process uploaded resumes via step functions in C#. The overall process took seconds to complete, and this made it unsuitable for actual users who demand immediate (or at least, subsecond,) results.

To recap, the state machine executed two lambda functions: one to create a pre-signed URL to download the resume from S3, and one that uses Textract­­ to find the GitHub profile in the resume. The result went to SQS for quick asynchronous consumption.

In this take, I will build a client facing API that can return resume data in real time using AWS lambda functions. The API will have three endpoints, POST, GET, and DELETE to handle these processed resumes.

The POST endpoint will upload the resume to S3 and kick off the asynchronous process via the state machine. This happens without logically blocking the API. The GET will query SQS for pending messages in the queue and quickly return processed information. Lastly, the DELETE endpoint will purge the queue once a recruiter is done reviewing resumes.

If you get lost with the code samples, be sure to check out the codes which is up on GitHub. This repo has the step functions from the previous part and this new API.

Continuing with Step Functions

Be sure to have the latest AWS CLI tools for .NET. If you are following from the previous post then you should already have the .NET 6 SDK, and the AWS CLI tool. The CLI tools are constantly evolving so I recommend upgrading to the latest available.

> dotnet tool update -g Amazon.Lambda.Tools
> dotnet new install Amazon.Lambda.Templates

This makes a list of templates available to you so you can spin up AWS projects. Because this is for an API, use the serverlessEmptyServerless template and create a new project inside the main project folder.

> dotnet new serverless.EmptyServerless --region us-east-1 
--name Aws.Api.ResumeUploader

This creates the scaffolding to build an API using lambda annotations. Lambda annotations bridge the gap between the lambda programming model and a more idiomatic .NET model already familiar to .NET developers.

This approach also takes care of creating a separate lambda for each endpoint. This means that the AWS Gateway routes the HTTP request and fires the lambda without ASP.NET or any reverse proxy in .NET doing the heavy lifting. This technique helps with cold starts and minimizes the work the lambda must do, which can help keep latencies low. The API itself is meant for real time consumption so keeping responses as quick as possible is ideal.

Because the scaffold is a bit silly and puts files and folders all over the place. I recommend restructuring your app in the following way. Simply flatten the folders and put them all under the main project like so.

A screenshot of a computer Description automatically generated

Figure 1. Folder structure

These new projects can coexist with the previously created step functions. Be sure to add these new projects to the main solution file.

> dotnet sln add Aws.Api.ResumeUploader\Aws.Api.ResumeUploader.csproj
> dotnet sln 
  add Aws.Api.ResumeUploader.Tests\Aws.Api.ResumeUploader.Tests.csproj

Now, do a dotnet build, and double check that project references between the unit test and the API project are correct. The files created by the scaffolding are all the files you are going to need for this API.

Because this is using lambda annotations, the serverless.template template file is automatically updated and generated for you.

Build the API

Time for some coding fun. These are the NuGet packages you are going to need to flesh out the entire solution.

  • Amazon.Lambda.Annotations: idiomatic approach to creating AWS lambda APIs in .NET
  • AWSSDK.S3: S3 bucket client to upload resumes from the API itself
  • AWSSDK.StepFunctions: step functions client to kick off the asynchronous process without blocking the API
  • AWSSDK.SQS: simple queue service for reviewing processed resumes and purging old ones
  • Moq: for the unit test project only, this is an object mocking tool in spite of a recent snafu with mishandling developer emails

All these dependencies, except for Moq, go in the API project. The lambda annotations dependency comes preinstalled via the scaffold. Be sure to upgrade all your NuGets to the latest version, this eliminates snags you might run into with the tooling.

API Skeleton

There should be a Functions.cs file with a Functions class. Gut the entire class and put this skeleton in place.

public class Functions
{
  private readonly IAmazonS3 _s3Client;
  private readonly IAmazonStepFunctions _sfnClient;
  private readonly IAmazonSQS _sqsClient;
  private const string S3BucketName = "<s3-bucket-name>";
  private const string StateMachineArn = "<state-machine-arn>";
  private const string SqsUrl = "<sqs-url>";
  public Functions(
    IAmazonS3 s3Client,
    IAmazonStepFunctions sfnClient,
    IAmazonSQS sqsClient)
  {
    _s3Client = s3Client;
    _sfnClient = sfnClient;
    _sqsClient = sqsClient;
  }
  [LambdaFunction(Policies = "AWSLambdaBasicExecutionRole", 
MemorySize = 1024, Timeout = 5)]
  [RestApi(LambdaHttpMethod.Get, "/")] // GET endpoint
  public Task<IHttpResult> Get(ILambdaContext context)
  {
    throw new NotImplementedException();
  }
  [LambdaFunction(Policies = "AWSLambdaBasicExecutionRole", 
    MemorySize = 1024, Timeout = 5)]
  [RestApi(LambdaHttpMethod.Post, "/")] // POST endpoint
  public Task<IHttpResult> Post(
    [FromBody] string fileContent, // raw binary string
    [FromQuery] string fileName,
    ILambdaContext context)
  {
    throw new NotImplementedException();
  }
  [LambdaFunction(Policies = "AWSLambdaBasicExecutionRole", 
MemorySize = 1024, Timeout = 5)]
  [RestApi(LambdaHttpMethod.Delete, "/")] // DELETE endpoint
  public Task<IHttpResult> Delete(ILambdaContext context)
  {
    throw new NotImplementedException();
  }
}

The lambda annotations set the basic settings for each lambda function like the memory size and timeout. This also injects dependencies using dependency injection without a default constructor. Notice the POST endpoint takes the file content as a raw string instead of binary data like a byte array. Be sure to specify your own S3 bucket name, state machine ARN, and the SQS URL, you will find these settings at the top of the class.

Unit Tests Skeleton

Open the FunctionsTests.cs file in the test project. Gut the entire class and put this in place.

public class FunctionsTest
{
  private readonly Mock<IAmazonS3> _s3Client;
  private readonly Mock<IAmazonStepFunctions> _sfnClient;
  private readonly Mock<IAmazonSQS> _sqsClient;
  private readonly Functions _functions;
  public FunctionsTest()
  {
    _s3Client = new Mock<IAmazonS3>();
    _sfnClient = new Mock<IAmazonStepFunctions>();
    _sqsClient = new Mock<IAmazonSQS>();
    _functions = new Functions( // inject mock dependencies
      _s3Client.Object,
      _sfnClient.Object,
      _sqsClient.Object);
  }
}

You should have a working build now. Build the entire solution and open the serverless.template file. In each generated lambda function configuration, add the following configuration.

"Properties": {
  "Architectures": [
    "arm64"
  ],
  "FunctionName": "resume-uploader-api-<verb-name>"
}

Under the FunctionName make sure that this matches the endpoint verb. For example, resume-uploader-api-post for the POST endpoint and so on. This makes it much easier to find your lambda functions in the AWS console.

The ARM architecture uses the Graviton2 chip by AWS and it is recommended for most use cases. Luckily, .NET supports both Linux and ARM so there is nothing to worry about.

TDD

Next, to follow TDD (Test Driven Development) write the unit tests first then the implementation. This forces you to think of a succinct solution before diving into more complex uses cases. Think of it like drawing a painting, you start with abstract strokes like the background, sort of like the unit tests. Then, work your way into complex details in the foreground, sort of like the gory details to pass the tests.

The POST, GET, and DELETE endpoints must all use the AWS clients provided so simply verify these clients are being used, and assert the status code of the response. At this point, the emphasis is not necessarily the gnarly implementation details.

[Fact]
public async Task PostReturns201()
{
  // arrange
  var context = new TestLambdaContext();
  const string fileContent = 
"TWFueSBoYW5kcyBtYWtlIGxpZ2h0IHdvcmsu"; // assume base64
  const string fileName = "test-file-name";
  // act
  var response = await _functions.Post(fileContent, fileName,
 context);
  // assert
  Assert.Equal(HttpStatusCode.Created, response.StatusCode);
  _s3Client.Verify(m => m.PutObjectAsync(
    It.IsAny<PutObjectRequest>(),
    CancellationToken.None));
  _sfnClient.Verify(m => m.StartExecutionAsync(
    It.IsAny<StartExecutionRequest>(),
    CancellationToken.None));
}
[Fact]
public async Task GetReturns200()
{
  // arrange
  var context = new TestLambdaContext();
  _sqsClient
    .Setup(m => m.ReceiveMessageAsync(
      It.IsAny<ReceiveMessageRequest>(),
      CancellationToken.None))
    .ReturnsAsync(new ReceiveMessageResponse
    {
      Messages = new List<Message>
      {
        new() {Body = "stuff"} // the body is all I care about
      }
    });
  // act
  var response = await _functions.Get(context);
  // assert
  Assert.Equal(HttpStatusCode.OK, response.StatusCode);
}
[Fact]
public async Task DeleteReturns202()
{
  // arrange
  var context = new TestLambdaContext();
  // act
  var response = await _functions.Delete(context);
  // assert
  Assert.Equal(HttpStatusCode.Accepted, response.StatusCode);
  _sqsClient
    .Verify(m => m.PurgeQueueAsync(
      It.IsAny<PurgeQueueRequest>(),
      CancellationToken.None));
}

Note in the POST endpoint, I am assuming that the body is a C# string in base64 format. This is because it really should be a byte array, or a stream. But since this is coming from the web, and specifically the AWS Gateway, this must be a string.

The GET endpoint simply takes the body property of the SQS messages and sends that back to the API client.

To pass these tests flesh out the implementation with these basic requirements from the unit tests.

[LambdaFunction(Policies = "AWSLambdaBasicExecutionRole", 
MemorySize = 1024, Timeout = 5)]
[RestApi(LambdaHttpMethod.Get, "/")]
public async Task<IHttpResult> Get(ILambdaContext context)
{
  var result = await _sqsClient.ReceiveMessageAsync
  (new ReceiveMessageRequest
  {
    QueueUrl = SqsUrl
  });
  return HttpResults.Ok(result.Messages.Select(m => m.Body));
}
[LambdaFunction(Policies = "AWSLambdaBasicExecutionRole", 
MemorySize = 1024, Timeout = 5)]
[RestApi(LambdaHttpMethod.Post, "/")]
public async Task<IHttpResult> Post(
  [FromBody] string fileContent,
  [FromQuery] string fileName,
  ILambdaContext context)
{
  var byteArray = Convert.FromBase64String(fileContent);
  using var inputStream = new MemoryStream(byteArray);
  await _s3Client.PutObjectAsync(new PutObjectRequest
  {
    BucketName = S3BucketName,
    Key = fileName,
    InputStream = inputStream // upload an actual binary stream
  });
  await _sfnClient.StartExecutionAsync(new StartExecutionRequest
  {
    Input = JsonSerializer.Serialize
    (new {storedFileName = fileName}),
    StateMachineArn = StateMachineArn
     });
  return HttpResults.Created();
}
[LambdaFunction(Policies = "AWSLambdaBasicExecutionRole", 
MemorySize = 1024, Timeout = 5)]
[RestApi(LambdaHttpMethod.Delete, "/")]
public async Task<IHttpResult> Delete(ILambdaContext context)
{
  await _sqsClient.PurgeQueueAsync(new PurgeQueueRequest
  {
    QueueUrl = SqsUrl
  });
  return HttpResults.Accepted();
}

The one perilous implementation detail here is having to take the raw string from the request body and turn it into a C# stream so the PDF file can be properly uploaded to S3.

Lastly, hook up all dependencies using the IoC container provided for you in the Startup file.

public void ConfigureServices(IServiceCollection services)
{
  services.AddAWSService<IAmazonS3>();
  services.AddAWSService<IAmazonStepFunctions>();
  services.AddAWSService<IAmazonSQS>();
}

If you get compilation errors, there a NuGet package you will need to make this magic possible. Simply add AWSSDK.Extensions.NETCore.Setup to the project.

Deploy the API

With the codes taking shape, you should have a good build and passing unit tests. Change directory into the Aws.Api.ResumeUploader project and run the following command.

> dotnet lambda deploy-serverless

The prompt should ask you for a name of the CloudFormation stack. Simply pick a name, I went with resume-uploader-api. The tool will also prompt for an S3 bucket to upload the zip files into before it deploys the lambda functions. Pick something like resume-uploader-api-upload. Keep in mind S3 bucket names must be unique, so if you are not able to use this name just pick a different one.

After the deploy tool successfully deploys, go into the AWS console and find IAM. Each endpoint has an execution role attached to it and they will need permissions to access AWS resources like SQS, S3, and Step Functions. If you have a hard time finding the execution roles, you can go via the lambda function. The execution role is listed under the Permissions in the Configuration tab.

Make sure the following permissions are specified for each endpoint.

A screenshot of a computer Description automatically generated

Figure 2. POST endpoint permissions

A screenshot of a computer Description automatically generated

Figure 3. GET endpoint permissions

A screenshot of a computer Description automatically generated

Figure 4. DELETE endpoint permissions

This is what the list of API lambda functions looks like.

A screenshot of a computer Description automatically generated

Figure 5. API lambda functions

Lastly, before we can upload PDF files to our API the AWS Gateway must be configured to convert the binary format into a base64 string.

Go to the AWS console and click on the API Gateway. Find your API, it should match the name you specified in CloudFormation. Then click Settings and add application/pdf to the list of Binary Media Types. This is what allows you to assume that the file content is a string in base64 format in your lambda function.

Call the API

The API is now able to handle binary files directly and has permissions to execute step functions. This gets you over the hurdle I ran into earlier because it was not possible to simply kick off step functions from an S3 bucket event.

To kick off the state machine and upload a resume via the API.

> curl -i -X POST -H "Content-Type: application/pdf" 
  --data-binary "@ExampleResume.pdf" 
  https://<GATEWAY-API-ID>.execute-api.<REGION>.amazonaws.com/Prod/?fileName=ExampleResume.pdf

Be sure to specify the whack at the end of the URL like /. The AWS Gateway is sometimes sensitive to this and returns a 403 Forbidden when it is missing. This CURL command assumes that the ExampleResume PDF file is under the current folder.

You should see a 201 Created response when the lambda function completes execution. Keep in mind that the latency should remain fast because all this does is upload a file and kick off a background process.

Give it a few seconds, then follow up the previous command with a call to the GET endpoint. If everything went well, you should have items in the SQS queue.

> curl -i -s -X GET -H "Content-Type: application/json" 
https://<GATEWAY-API-ID>.execute-
api.<REGION>.amazonaws.com/Prod/

If there is still nothing in the queue, check in CloudWatch or Step Functions. Luckily, AWS has some really nice monitoring tools to figure out what went sideways. Also, be sure to check the S3 bucket and make sure that the file uploaded is a proper PDF file.

Finally, once you are done, the DELETE endpoint purges the queue entirely.

> curl -i -s -X DELETE -H "Content-Type: application/json" https://<GATEWAY-API-ID>.execute-
api.<REGION>.amazonaws.com/Prod/

Conclusion

Step functions in C# embrace everything you already know about building complex back-end systems. When latency is a concern, an asynchronous process can run in the background and allow APIs and other customer facing solutions to remain responsive.

The post AWS Step Functions in C# – Part 2 appeared first on Simple Talk.



from Simple Talk https://ift.tt/pIfvew7
via

Wednesday, December 6, 2023

SQL Server: Optimize for Ad Hoc Workloads – use or not use

I used to believe this option was something almost absolute: it should be enabled as best practice. I was in some way happy when all my demos for SQL Server 2022 started to fail, and I discovered it was because this option was enabled by default.

This weekend I attended a technical session which caught my attention to many blog posts stating the opposite. Here are some of them:

Let’s analyse these scenarios in more details. But if you are arriving now, you can take a look on my blog about Ad Hoc Server configuration from some years ago.

Analysing the scenarios

They are not wrong on their statements by themselves. But there may be conceptual scenarios where they apply, and others where they don’t.

The Difference Between Operational and Analytical Data Systems One of the main claims is that when the same query is sent with different values, it is understood as two ad hoc queries. The DBA can’t see their plan to make any optimization, creating a problem.

Let’s analyze this problem from different points of view.

This only happens if the query is sent by tools such as SSMS or similar. If the query is sent by an application, most SQL Server access frameworks parameterize the query.

In production environments, queries from SSMS or similar are ad hoc queries, unless someone is doing something very wrong.

This brings us to an interesting realization:

For production environments this setting may continue to be a best practice, but for analytical environments, not much.

What is described on the blogs above would happen in analytical environments, but it would be way rarer in production environments.

 

Let’s consider Some Exceptions

However, after talking with some fellow DBA’s, I discovered some exceptions. A company using a custom developed framework for database access, for example.

First, I would say these exceptions would be in a huge risk of suffering SQL Injection attacks. Are they really protected?

Let’s say they are. The company took all the precautions needed against this risk. Even so, a custom framework preventing query parameterization is a big technical debt.

Some could ask: Why? Only because the bloating of the query plan cache? What if there is enough memory and this is not a big deal for them?

Query Store and the Ad Hoc Queries

The big deal is that they are not using query store. They couldn’t even use it, because each query would be recognized as a different one, bloating query store, affecting database performance and resulting in nothing.

Parameterization Configuration

In order to solve these exceptional scenarios, I would strongly recommend the use of parameterization forced configuration.

This is a configuration on database level which would lower the restrictions SQL Server has when parameterizing queries. When the Parameterization configuration is set to Simple, very few queries are parameterized, usually the ones searching over the primary key.

A screenshot of a computer Description automatically generated

When the Parameterization configuration is set to Forced, most queries are parameterized, independent of how they are sent to the server. The application may not be parameterizing, but the server will.

To change this configuration, you can run the following statement:

ALTER DATABASE AdventureWorks2012 SET PARAMETERIZATION FORCED

There is always the risk to parameterize something which shouldn’t be parameterized and as a result increase the amount of parameter sniffing problems. However, if the applications are already parameterizing the queries, on a regular environment this option would have minimum effect.

On a side note, you can read more about parameter sniffing on my article from some years ago.

However, on the exceptions described, the parameterization forced seems to be extremely recommended.

Why? Only because of the query store?

Additional Consequences of not parameterizing

Since query store was released in SQL Server 2016, each new version of SQL Server brings additional features for query optimization and many of these features are query-store-based.

SQL Server 2022 was no exception, on the opposite: It brought incredible features which are, in one way or another, depending on query store.

A diagram of a process Description automatically generated

Let’s see what happens with these features in a scenario where the queries are not recognized as parameterized queries.

SQL Server 2022 Optimization Features

DOP Feedback: It uses query store history, and it doesn’t work if query store is disabled.  If the queries are not recognized as the same, this feature would not work.

CE Feedback: In the same way as DOP Feedback, this feature would not work.

PSP: This feature doesn’t depend on query store at all but depends on the query being identified as the same, with different literal values. If this identification fails, this feature will fail as well.

If you would like to know more about these advanced optimization features and how Query Store evolved to them, you can watch the videos of the Malta SQL Server 2022 Query Optimization Conference.

In summary, if “Optimize for ad hoc workloads” is not enabled, query store will be bloated, in the best scenario. On the worst scenario, you will be losing all these optimization opportunities, because queries with different literal values will not be recognized as one.

Summary

In summary, if you disable this option, you are in risk to bloat query store. If you have reasons to need to disable this option, you will be for sure bloating query store and losing the most modern optimizations available.

On the other hand, the points against this option are correct in relation to analytical environments, where we would not expect the same query to repeat lots of times. Different queries will be executed all the time.

Thank you to my friend Alessandro Mortola for the inspiration and help this blog

 

The post SQL Server: Optimize for Ad Hoc Workloads – use or not use appeared first on Simple Talk.



from Simple Talk https://ift.tt/WX4j7Na
via

Monday, December 4, 2023

Understanding SQL Join Types

Understanding how to join the data in one table to another is crucial for any data analyst or database developer working with a relational database. Whether you’re a beginner or an experienced SQL user, this article will help you strengthen your SQL skills and become proficient in SQL joins.

With several types of joins available, it can be overwhelming to understand the differences and choose the right one for your query. Each join type is very similar, but different in a few very important ways. In this article, we’ll explore various types of SQL joins using PostgreSQL (though a lot of this will work in most other RDBMSs out there).

There are multiple types of joins that go by very similar names including inner joins, left outer joins, right outer joins, full outer joins, and cross joins. We will cover all these types, as well as a few join subtypes of joins, including self and natural joins. We’ll explain their functionalities and use cases and provide examples to help you grasp the concepts more effectively.

Prerequisites

To get started with the material in this article, the following will be very helpful.

  • Knowledge of SQL basic
  • Familiarity with SQL tables and their structures.

Note that if you are a novice with PostgreSQL (or any other RDBMS, most of this should be relatively straightforward.) In addition, the following skills will be helpful:

  • Basic knowledge of SQL syntax and structure will be helpful.
  • Familiarity with PostgreSQL or similar SQL database management system
  • Access to a PostgreSQL database for practicing SQL queries. How to install PostgreSQL will be covered early in the article.

Setting up the database environment

In this article, we will explore the different types of SQL joins and explain how to use them effectively using PostgreSQL. By the end of this article, you will clearly understand each join type and when to use it. If you want to follow along and try out the concepts presented, there are a few things we need to set up; including the database server, the tools, and a database to query.

Installing PostgreSQL cluster

To follow along, you will need access to a database environment. To set up a PostgreSQL database environment, visit www.postgresql.org, choose the appropriate version for your OS, and follow the installation instructions. Download both the server and its components. The article was written using Postgres 15 as the platform, but during the editorial process, PostgreSQL 16 was released. All of the code will execute on PostgreSQL 16 and other recent versions of PostgreSQL.

For accurate and detailed instructions about the different releases of PostgreSQL, refer to the official documentation on their website. It will provide comprehensive resources for setup and usage.

Installing pgAdmin

Additionally, we will leverage pgAdmin, a robust graphical interface that streamlines the creation, maintenance, and utilization of database objects. It caters to the requirements of both novice and seasoned Postgres users. Follow the download instruction and documentation to set up properly. (Note that pgAdmin may be able to be installed with your RDBMS install.)

Installing the Sample Database

To give us tables and data to join, we will use a sample relational database called Northwind. The Northwind database is a well-known and widely used sample database in database management. Created by Microsoft, it serves as a popular tool for demonstrating concepts in database courses, tutorials, and examples. The Northwind database depicts a fictional company known as “Northwind Traders,” which sells various products to customers. It consists of multiple tables representing the company’s operations, including customers, orders, products, suppliers, employees, and categories. These tables are interconnected through relationships, facilitating complex queries and data analysis.

To install Northwind, follow the instructions on Wikiversity. The process straightforward and includes a script that creates the tables and data without even needing to restore a database. The following diagram is the Entity-Relationship (ER) diagram that shows the tables, columns, and relationships between the tables in the database. As we write some of the code, it can be helpful to see own tables are related to one another.

Note: in the diagram, the lines between the tables show a relationship between the tables. The solid dot is the table that is the child in the relationship. The other table is the parent, and its primary key (the column that is in bold above the line in each table that is the primary way to identify a row in a table) from the parent is repeated in the child table to form the connection we will use in sample queries. 

Northwind Database Schema

General Join Principles

Understanding the syntax, process, and various join types is crucial for harnessing the full potential of database management systems and allow you to combine data in very interesting ways. In this section, we will delve into joins, exploring their general syntax, the underlying process, handling matching and non-matching rows, and the differences in join types.

A join in a relational database typically involves using SQL to specify the tables to be joined, the join condition, and the desired columns in the result set.

The Process of Joining Tables:

The process of joining tables entails the following steps:

  • Identifying the Tables: Determine the tables you want to join based on the data you need to combine and the relationships between them.
  • Identifying the Join Condition: Identify the column(s) in the tables that have matching values and will serve as the join condition.
  • Choosing the Join Type: Select the appropriate join type based on the desired output, such as an inner join, left join, right join, or full outer join.
  • Writing the Joins: Compose the SQL query using the chosen join type and the specified join condition.
  • Executing the Query: Execute the query against the database and retrieve the joined result set.

When joining tables, some rows will have matching values in the specified join condition, while others will not. What happens to the different values will be based on the type of join you use.:

  • Matched Rows: Rows with matching values in the join condition will be included in the joined result set. The columns from both tables will be combined based on the join condition.
  • Non-Matched Rows: Rows that do not have matching values in the join condition will be handled differently based on the join type.

As an example, say you have two tables X and Y. We will use these tables for some simple example queries before moving to the Northwind tables.

CREATE TABLE X (
Xid int CONSTRAINT PKX PRIMARY KEY
);
CREATE TABLE Y (
Yid int CONSTRAINT PKY PRIMARY KEY,
Xid int --should be an FK constraint
--but left off for examples.
);

INSERT INTO X (Xid)
VALUES (1),(2),(3);

INSERT INTO Y (Yid, Xid)
VALUES (1,2),(2,3),(3,3),(4,4);

After creating the tables and executing the INSERT statements, the tables have the following data.

Table X:

Xid
1
2
3

Table Y:

Yid Xid
1 2
2 3
3 3
4 4

Since table Y references table X (commonly referred to as being a child of table X), we need to connect the data using the value in the Y table that references the data in table X.

Putting the tables side by side, matching on the Xid column using the join criteria is X.Xid = Y.Xid, we can see how the data matches:.

X.Xid Y.Yid Y.Xid
1    
2 1 2
3 2 3
3 3 3
    4

Note that for Xid = 3, two Y rows match one X row, so that data has been duplicated. The queries we will write will connect the two tables in variations of this pattern.

Join Styles

You can specify join criteria in the WHERE or FROM clause of a statement. Although both approaches can produce the same results in certain situations, they differ in functionality and recommended usage.

  • Explicit Join: Using the JOIN keyword to connect objects to join, you specify the join join criteria in an ON clause. This is known as explicit join notation, which involves using the ON keyword to specify the predicates or conditions for the join and the JOIN keyword to indicate the tables to be joined.

    This approach explicitly defines how the tables relate, improving query readability and organization. For more complex queries, evaluating the join conditions early can allow the database engine to optimize the query execution plan, potentially leading to enhanced performance.

    Syntax:

SELECT column_names
FROM Table1
    JOIN Table2
     ON Table1.column_name = Table2.column_name
  • Implicit Join: In this notation, the tables to be joined are listed in the FROM clause of the SELECT statement, separated by commas. The join predicates are then specified in the WHERE clause. This method has limitations; it does not support OUTER joins naturally. If one is not cautious, it can easily produce results equivalent to a CROSS join, especially when the join criteria are missed or mistaken. it can result in less readable and less optimized queries, particularly when working with large datasets or complex joins.

    Syntax:

SELECT column_names
FROM   Table1, Table2
WHERE  Table1.column_name = Table2.column_name

Relationships

Next, let’s explore SQL relationships essential for comprehending joins. Joins allow us to merge data from multiple tables based on their relationships. Relationships play a pivotal role in organizing and connecting data between different tables.

Relationships allow you to facilitate normalization, as they enable you to store related information in separate tables and eliminate duplicated data. Joins will allow you to re-combine the data as needed during queries for usage.

They essentially define how data in one table is related to data in another table. The three fundamental types of relationships are one-to-one, one-to-many, and many-to-many. In the diagram, the end with the “crow’s foot” represents the relationship where one or more rows can be related:

Source: Author

  • One-to-One Relationship: In relational database design, a one-to-one (1:1) relationship signifies that a record in Table A is associated with zero or one record in Table B (note that one table will need to be considered the parent in the relationship due to how relationships are implemented.). In this relationship, each record in Table A corresponds to a single record in Table B, and no duplicate values are allowed. One-to-one relationships are typically used when data needs to be separated into multiple tables for normalization or security purposes.
  • One-to-Many Relationship: In relational database design, a one-to-many (1:N) relationship signifies that one instance of Table A can be associated with zero, one, or more instances of Table B. In comparison, each instance of Table B can relate to zero or one instance of Table A. One-to-many relationships are widely used to represent hierarchical or parent-child relationships between data.
  • Many-to-Many Relationship: In relational database design, a many-to-many (M:N) relationship means multiple records in Table A can associate with multiple records in Table B and vice versa. Specifically, an instance from Table A can relate to zero, one, or more instances of Table B, and similarly, an instance from Table B can link to zero, one, or more instances of Table A.

When working with joins, it’s essential to understand the relationships that the designer intended between the tables you are joining thoroughly. This includes recognizing both matching and non-matching rows or instances between the tables. The presence or absence of multiple matches can influence the outcome of the join operation. Being attentive to these details ensures accurate interpretation of the results and maintains data integrity.”

SQL join types play a crucial role in determining the treatment of matching and non-matching rows during the join operation. Let’s explore a concise summary of the distinctions between various join types.

Join Type Overview

In this section I will go describe each of the join types, and then in the next section we will work through some examples.

Inner Join

An inner join returns only the rows with matching values in the specified column(s) from both tables. It combines data from two tables based on the common column(s) and eliminates non-matching rows.

Syntax:

SELECT * 
FROM X
INNER JOIN Y
ON X.Xid = Y.Xid;

Using the tables we created earlier. This query will return:

xid yid xid
2 1 2
3 2 3
3 3 3

Looking at the columns, they will be based on the table to the left of the join first, then the tables from the right column. Note that xid = 1 does not show up in the results for the X table and xid = 4 does not show up from the Y table..

Left Outer Join:

Left outer join returns all rows from the left table (Table1) and the matching rows from the right table (Table2). Non-matching rows in the right table will contain NULL values in the result set.

Syntax:

SELECT * 
FROM X
LEFT JOIN Y
ON X.Xid = Y.Xid

Now the results are the same, except there is one additional row returned. The X row where xid = 1 will show up now with NULL values for all of the values for the Y table’s columns.

xid yid xid
2 1 2
3 2 3
3 3 3
1    

Right Outer Join

Right outer join returns all rows from the right table (Table2) and the matching rows from the left table (Table1). Non-matching rows in the left table will contain NULL values in the result set. If you change the join to be a RIGHT JOIN:

SELECT * 
FROM X 
RIGHT JOIN Y 
ON X.Xid = Y.Xid;

The output now includes the Y row where xid = 4.

xid yid xid
2 1 2
3 2 3
3 3 3
  4 4

Full Outer Join:

Full outer join returns all rows from both tables, including matching and non-matching rows. 

Syntax:

SELECT * 
FROM X
FULL OUTER JOIN T
ON X.Xid = Y.Xid;

Now you will see we are back to the version of the output that was shown when I showed you the data side by side.

xid yid xid
2 1 2
3 2 3
3 3 3
1    
  4 4

This is very often used to find rows in two tables where some condition is not true in two tables. Like in this case I can find all rows where X.Xid is not in Y and Y.Xid is not in X by adding where X.Xid is null or Y.Xid is null. Then your output will be just he last two rows.

Cross Join:

Cross join (also known as Cartesian join) returns the Cartesian product of both tables. It combines each row from the first table with every row from the second table, resulting in a potentially large result set. One of the more common uses of the CROSS JOIN is to add the contents of a single row to a lot of rows. Note that there is no join criteria, and every row in one input matches every row in the other.

Syntax:

SELECT * 
FROM X
CROSS JOIN Y;

The output of this query a lot longer than the others, because it will return (number of rows in X) * (number of rows in Y) rows. In this case 3 * 4 or 12 rows.

xid yid xid
1 1 2
1 2 3
1 3 3
1 4 4
2 1 2
2 2 3
2 3 3
2 4 4
3 1 2
3 2 3
3 3 3
3 4 4

Note too that the following will output the same results:

SELECT * 
FROM X,Y;

And this too, because just like the CROSS JOIN, every row matches every other row.:

SELECT * 
FROM Y
INNER JOIN X
      ON 1=1; --any criteria that is always true

Join Subtypes

The following examples are syntaxes but are types of joins you can do that are interesting to understand.

Self Join

A self join involves joining a table to itself, treating it as two separate instances. It allows for comparing and combining rows within the same table based on specified column(s).

Syntax:

SELECT * 
FROM X
INNER JOIN X AS T2
ON X.Xid = T2.Xid;

Note that if the column used is the PRIMARY KEY, it will basically just duplicate all of the data in Table1 twice (as you would see if you executed the sample query). This type of query is typically done with things like employee and manager relationships.

For example, if Table1 was Employee, with an EmployeeId for the primary key value, you might have a column ManagerEmployeeIdand you would execute:

SELECT * 
FROM Employee
INNER JOIN Employee AS Manager 
   ON Employee.ManagerEmployeeId = Manager.EmployeeId

Natural Join

A natural join automatically matches columns with the same name in the two tables being joined. It eliminates the need to specify the join condition explicitly, assuming that column names and datatypes match accurately.

SELECT * 
FROM X
NATURAL JOIN Y;

The output from this query will be just like the INNER JOIN example except for one thing. The column that is used for the join criteria (or criterion if you have multiple columns with the same name) will not be repeated. So the output will be:

xid yid
2 1
3 2
3 3

 

Note that in the example above, that is an INNER join on the columns that match in X and Y. You can do a NATURAL OUTER JOIN, or any of the other join types as well, the NATURAL modifier just makes it use the columns from the tables.

Join Examples

Mastering the art of joining tables is a fundamental skill for anyone working with relational databases. By understanding the general syntax, the underlying process, and the differences in join types, you can effectively combine data from multiple tables, extract meaningful insights, and unleash the power of your database management system. With this knowledge, you’ll be well-equipped to navigate the world of generic joins and leverage their capabilities to your advantage.

In this section, we will go through some examples using the Northwind database that we included instructions for early in the article.

Inner Join

Let’s examine what a few tables look like that we are going to be joining together for an INNER JOIN. First, the customer table looks like by returning a few rows from the table. Using pgAdmin, start a new query in the Northwind database and execute the following query:

SELECT * 
FROM customers
LIMIT 5;

Query results from PgAdmin, Source Author

This returns 5 rows from the table. For small databases on limited servers, this will almost always return the same data, but be aware that without an ORDER BY clause, there is no guarantee of any order to the output.

Next, let us quickly examine the orders table that we intend to join with the customers table.

SELECT * 
FROM orders
LIMIT 5;

This query will retrieve 5 rows from the orders table and display all columns for each of those rows.

Query results from PgAdmin, Source Author

We will now perform an INNER JOIN between the orders table and the customers table, resulting in the following query:

SELECT c.customername, o.orderid, o.orderdate 
FROM customers c
      INNER JOIN orders o
         ON c.customerid = o.customerid
LIMIT 5;

This query will retrieve a result set of five rows (the LIMIT clause will cut off the output of rows at the number specified) that includes the customer’s name, order ID, and order date for all customer orders in the Customers table. Using an INNER JOIN, only rows with matching customer IDs in the customer and orders tables will be included in the result set.

Query results from PgAdmin, Source Author

In summary, this query retrieves the customer name, order ID, and order date for the first five records where there is a matching customer ID between the customers and orders tables.

Inner joins can be used in various scenarios, from simple to complex queries involving multiple tables. They can also be combined with other SQL keywords, such as WHERE clauses and aggregate functions, to refine the result set further. Understanding how inner joins work can help write efficient and effective queries.

Left Outer Join

The LEFT OUTER JOIN, also known as simply a left loin, is a join operation in database management systems that returns all the rows from the Table1 and the matching rows from the Table 2. The result will contain NULL values if there is no match in the right table.

To provide more context, we will utilize our Northwind database and leverage the existing customer and orders tables in the following manner:

SELECT c.customerid, c.customername, o.orderid, o.orderdate
FROM customers c
LEFT OUTER JOIN orders o
ON c.customerid = o.customerid;

This query will return a result set that includes the customer ID, customer name, order ID, and order date for all customers, whether they have placed an order or not. A subset of the rows that will be returned are shown here:

Query results, Source Author

If you order by the customername via the query

SELECT c.customerid, c.customername, o.orderid, o.orderdate
FROM customers c
LEFT JOIN orders o
   ON c.customerid = o.customerid
ORDER BY c.customername;

You will notice duplicates in the customername column because multiple orders from the same customer are present in the orders table. Since the query performs a LEFT JOIN between the customers and orders tables, it retrieves all rows from the customers table and matches them with corresponding rows from the orders table based on the customerid column. If a customer has placed multiple orders, their customer information (including the customername) will appear in multiple rows in the result set, each row representing a different order placed by that customer. 

Using a LEFT JOIN in the previous query ensures that all records from the customers table will be included in the result set, regardless of whether there are matching records in the orders table. Using a LEFT JOIN, the query combines the data from the left table (customers) with matching records from the right table (orders) based on the specified join condition. If no matching records exist in the right table, NULL values are populated in the result set for the corresponding columns.

The advantage of this approach is that it allows us to retrieve customer data and any associated order information, but it’s worth noting that aggregating data with left join without taking account of the duplicates rows may lead to incorrect results.

Even if customers have not placed any orders, their information will still be included in the result set with NULL values in the order-related columns. Overall, the LEFT JOIN ensures that all customers are included in the result set, whether they have placed any orders. It provides a comprehensive view of customer data, incorporating relevant order information where available while maintaining the integrity of the customer records.

There are several benefits to using left outer joins in database management systems:

  • Complete data retrieval from primary table: Left outer joins allow for complete retrieval of data from a primary table even if there is no matching data in a related table. This ensures that all records from the primary table are included in the result set, even if no corresponding data exists in the related table.
  • Improved data analysis: Left outer joins can help to identify gaps or missing data in a related table, data quality assessment and aid improved data analysis and more informed decision-making.

By including data from the left table that may not have any matching records in the right table, left outer joins to enable us to retrieve data from multiple tables. Understanding how left-outer joins work and their benefits can help us write efficient and effective queries.

Right Outer Join

A Right Outer Join, also known as a Right Join, is a type of SQL join that reverses the roles of the two tables compared to in Left Outer Join. While a Left Outer Join ensures all records from the left (or primary) table are included in the result set, a Right Outer Join ensures all records from the right (or secondary) table are retained, regardless of whether there are matching records in the left table. This reversal makes the Right Outer Join particularly useful when you want to prioritize the data from the right table while incorporating any corresponding data from the left table.

Full Outer Join

A full outer join in SQL, also known as a full join, is a join operation that merges the output of left and right outer joins to make sure no data is lost in the join. 

For instance, if we have two tables, Table A and Table B, and wish to join them on a common field called “ID“, a full outer join will include all rows from both tables, filling in NULL values where there is no corresponding match (and duplicating some of the rows much like we discussed in the Left Outer Join section earlier). To illustrate this concept, we will use the order table and the orderdetails table. These tables contain information about orders and the corresponding details of each order.

SELECT c.postalcode, o.orderid
FROM customers c
   FULL OUTER JOIN orders o 
      ON c.customerid=o.customerid
ORDER BY c.customername;

This returns the following

Query results, Source Author

The FULL OUTER JOIN ensures that all records from both tables are included in the result set, regardless of whether there is a matching record in the other table. This means that the query will return the combination of data from both tables, and any records that do not have corresponding matches in the other table will be included with NULL values in the respective columns. In the result set, we can see that the FULL OUTER JOIN includes all postal codes from the customers table and all order IDs from the orders table, including NULL records from joined fields from both tables.

Note too that there were duplicated postal code values, and that some of the postal code values are the empty string ‘’, which is semantically different from NULL.

Full outer joins in database management systems offer several benefits when working with relational data:

  1. Comprehensive data merging: Full outer joins allow you to combine all rows from two related tables, regardless of whether there is a match between the tables. This provides a complete picture of the data in both tables, making it easier to analyze and work with the combined dataset.
  2. Identify data discrepancies: Full outer joins can help identify mismatches or inconsistencies between two tables. When a full outer join is performed, and there is no match for a row from one table, the result will still include that row, but with NULL values for the columns from the other table. These NULL values can indicate missing or inconsistent data, which can be helpful for data validation and cleanup.
  3. Simplify complex queries: A full outer join can sometimes simplify complex queries that would otherwise require multiple steps or a combination of different join types (e.g., left and right outer joins). You can retrieve all the necessary data in a single query using a full outer join, making the code more readable and easier to maintain. For instance, in the example query, we can utilize a Full Outer Join to identify orders without an associated postcode. This information can be used to identify orders that doesn’t have an associated postcode that can affect the delivery process .
  4. Flexible data analysis: Full outer joins offer flexibility when analyzing data from multiple tables. They enable you to retrieve information from both tables regardless of matching conditions, which can be particularly useful when working with optional relationships or when analyzing data that may not have a direct association.

Cross Join

A cross join, also known as a Cartesian join or Cartesian product, is a join operation in SQL that combines all rows from two tables without any condition. A Cross Join can be considered a specialized type of JOIN where the join condition always evaluates to TRUE, combining all rows from both tables.

Unlike other join operations that depend on specific matching conditions defined in the ON clause, a CROSS JOIN generates all possible combinations of rows from the involved tables. As noted in the overview section, this implies that any type of JOIN can inadvertently lead to a cross join if not written correctly. In other words, each row from the first table is paired with every row from the second table, resulting in a new table containing all possible combinations of rows from the original tables. The number of rows in the result set of a cross join equals the product of the number of rows in the first table and the number of rows in the second table. For example, if we have Table A with m rows and Table B with n rows, a Cross Join will return m * n rows, where each row from Table A is combined with each row from Table B.

Let’s perform a cross-join between the products and categories tables in the Northwind database. This will give us all the possible combinations of products and categories, even if the product does not belong to a specific category. Although this example might need to be revised in a real-world scenario, it demonstrates the concept of cross-joins using the existing product and categories.

SELECT p.productid, p.productname, c.categoryid, c.categoryname
FROM "products" p
CROSS JOIN "categories" c;

This returns the following (or at least a subset of what is returned because it is too large for the article):

Query results, Source Author

The CROSS JOIN in this query combines all rows from the products and categories tables, resulting in a Cartesian product of the two tables. This means that the result set will include every possible combination of products and categories, even if the product does not belong to the category. For example, if there are 10 products and 5 categories, the result set will contain 50 rows, one for each combination of a product and a category. Each row will contain the product ID, product name, category ID, and category name. This query is useful for getting a comprehensive overview of all products and categories, regardless of their relation. For example, this query could be used to identify products that could be added to new categories or to identify categories that could be expanded to include new products.

Cross joins in database management systems have specific use cases and benefits when working with relational data:

  • Generate all possible combinations: Cross joins produce a Cartesian product between two tables, creating all possible combinations of rows from the first table with rows from the second table. This can be useful in scenarios where you must explore all possible pairings or scenarios between two data sets.
  • Add one row to a all the other rows in a set. For example, you might CROSS JOIN to a single row of computation factors to have it available in every row in your table.
  • Support for testing and data generation: Cross joins can be helpful in testing and generating sample data for various scenarios. For example, you can use a cross-join to create a dataset with all possible product options and configuration combinations, which can then be used for testing, analysis, or data modeling.
  • Solve complex problem: In some cases, a cross-join can simplify complex queries or calculations that would require multiple steps or custom code. For example, you might use a cross-join to generate all combinations of input parameters for a query, allowing you to evaluate all possible scenarios in a single pass.
  • Create combinations for decision support: Cross joins can generate all possible combinations of decision variables in decision support systems or optimization problems. This can help identify optimal solutions, evaluate trade-offs, and explore the solution space more effectively.

It’s important to note that cross-joins are only sometimes the most efficient or appropriate choice for every scenario. They can produce many rows in the result set, combining every row from the first table with every row from the second table. This can lead to performance issues or unnecessary complexity in your query. In most real-world scenarios, you would typically use other join types like inner, left, or right to retrieve more relevant and related data.

Self Join

A self-join is not a type of join configuration but a technique in SQL where a table is joined to itself, usually using an alias to differentiate between the original table and its copy. A self Join works by treating a table as two separate tables and joining them together. Self-joins are used to establish a relationship between rows within the same table based on a particular condition.

Here’s an example of a self-join using the Northwind database. In this example, we’ll use a self-join on the employees table:

SELECT 
    e1.first_name AS EmployeeFirstName, 
    e1.last_name AS EmployeeLastName, 
    e2.first_name AS ManagerFirstName, 
    e2.last_name AS ManagerLastName
FROM 
    employees e1
LEFT JOIN 
    employees e2 ON e1.reports_to = e2.employee_id
ORDER BY 
    ManagerFirstName, EmployeeFirstName;

This will return the following, again truncated results:

Query results, Source Author

This query retrieves a list of employees along their respective managers (who they report to) from the employees table. Since we are using a LEFT JOIN employees without managers will still appear in the result but will NULL values as seen above.

Benefits of using self-joins in database management systems:

  • Retrieve hierarchical data: Self-joins are particularly useful when working with hierarchical data, where rows in the table have parent-child relationships. In such cases, self-joins can be used to retrieve the hierarchy or lineage of records.
  • Retrieve indirect relationships: Self-joins can be used to retrieve indirect relationships between rows in the same table, such as finding common connections or shared attributes between records. For example, imagine the “Employee” table had a “supervisorid” field, we can use a self join to identify common/shared supervisor by different employees. For example:
SELECT e1.EmployeeName AS Employee1, 
       e2.EmployeeName AS Employee2, 
       e1.SupervisorID AS CommonSupervisorID 
FROM Employees e1 
     JOIN Employees e2 
      ON e1.SupervisorID = e2.SupervisorID

Natural Join

A natural join is a join operation in SQL that automatically combines two tables based on columns with the same name and data type in both tables. A natural join is a shorthand for joining on columns with the same name.

SELECT c.customername, o.orderid, o.orderdate 
FROM customers c
      NATURAL JOIN orders o

This returns the following based on the shared column orderid.

In a natural join, the database management system identifies columns with matching names and data types in both tables and uses these columns as the basis for the join condition. The natural join eliminates duplicate columns in both tables, returning only one copy of each matching column in the result set. It is important to note that natural joins can be risky, as they rely on column names and data types rather than explicitly defined join conditions. If you share names other than the relationship key values, this can make it impossible to use a NATURAL JOIN.

This can lead to unintended results if the table column names are unique and descriptive. Most modern database management systems, such as PostgreSQL, SQL Server, and Oracle, do not support natural joins directly in their SQL syntax. Instead, they require you to explicitly define the join conditions using an INNER JOIN, LEFT JOIN, or RIGHT JOIN with the ON clause.

However, some databases, like MySQL, support the NATURAL JOIN keyword. In general, it’s recommended to use explicit join conditions (e.g., INNER JOIN with the ON clause) instead of relying on natural joins, as this approach ensures that you have full control over the join operation and reduces the likelihood of unintended results (like if someone does add a new column to a table and that has the same name that your code has used a NATURAL JOIN in).

Combination of Joins

Combining different types of joins can significantly enhance the accuracy, performance, and readability of SQL queries. Let’s explore the power of join combinations using an example query that retrieves customers with orders and order details if there are any.:

SELECT c.customerid, c.customername, o.orderid, o.orderdate, p.productname
FROM Customers c
    INNER JOIN Orders o  
       ON c.customerid = o.customerid
    LEFT JOIN OrderDetails od 
       ON o.orderid = od.orderid
    INNER JOIN Products p 
       ON od.productid = p.productid;

This returns the following:

Query results, Source Author

First, an inner join is performed between the Customers and Orders tables using the CustomerID column. This ensures that only customers who have placed orders are included in the result. Next, a left join is applied between the Orders and OrderDetails tables using the OrderID column. This allows all orders to be included in the result regardless of whether they have corresponding order details. Finally, an inner join is executed between the OrderDetails and Products tables using the ProductID column. This ensures that only order details with valid product information are included in the result.

To summarize the query:

  • Customers and Orders tables are joined using an inner join on CustomerID.
  • Orders and OrderDetails tables are joined using a left join on OrderID.
  • OrderDetails and Products tables are joined using an inner join on ProductID.
  • The SELECT clause specifies the columns to be retrieved: CustomerID, CustomerName, OrderID, OrderDate, and ProductName.

The second join actually join to the set formed by the Customer and Order table, so while in most cases you will join to the columns of just one table, it is possible that your join to the OrderDetails table could use Customers columns in the criteria. It is beyond the scope of this article to include more details.

By leveraging this combination of joins, businesses gain access to more comprehensive result sets that foster various analyses, such as order analysis, inventory management, product performance analysis, pricing, and profitability analysis. The retrieved data empowers data-driven decision-making and optimization of operations in a competitive marketplace. The use of join combinations, as demonstrated in this example query, allows businesses to uncover valuable insights and make informed decisions. It enhances the depth and breadth of information retrieved from multiple tables, enabling comprehensive analysis and optimization. By leveraging these combined joins, businesses can unlock hidden patterns, understand customer behavior, and drive success in their operations.

Conclusion

SQL joins are an essential component of database management that can make queries more efficient and productive. Understanding the different types of joins, including the inner, left outer, right outer, full outer, cross, self, natural joins, and combination of joins, can provide a significant advantage to anyone working with data.

By using PostgreSQL, individuals can practice and experiment with different types of joins in enhancing their data analysis skills. With the benefits of SQL joins, including increased productivity and improved data accuracy, mastering this skill can make a significant difference in data management. Therefore, we encourage our readers to continue practicing and exploring SQL joins to become proficient in database management.

 

The post Understanding SQL Join Types appeared first on Simple Talk.



from Simple Talk https://ift.tt/ROMCn7i
via