Tuesday, August 31, 2021

Text Mining and Sentiment Analysis: Oracle Text

The series so far:

  1. Text Mining and Sentiment Analysis: Introduction
  2. Text Mining and Sentiment Analysis: Power BI Visualizations
  3. Text Mining and Sentiment Analysis: Analysis with R
  4. Text Mining and Sentiment Analysis: Oracle Text

The fourth article of the series will focus on leveraging Oracle Text features in Oracle Database to perform sentiment analysis.

The first three articles of this series covered Sentiment Analysis using tools and services in the Microsoft ecosystem: Azure cognitive services and Power BI. This article will explore the Text features in an Oracle Database to perform Sentiment Analysis. The approach demonstrated in this article is useful in environments where the Oracle platform, tools, and technologies are predominantly used. Several organizations have not fully embraced the use of the public cloud for various reasons. This approach is suitable when using the public cloud may not be feasible in your organization.

Oracle Text feature has been available since Oracle Database release 12c. It provides indexing, word and theme searching, sentiment analysis, and document classification capabilities. According to Oracle documentation “Oracle Text can perform linguistic analysis on documents, as well as search text using a variety of strategies including keyword searching, context queries, Boolean operations, pattern matching, mixed thematic queries, HTML/XML section searching, and so on. It can render search results in various formats including unformatted text, HTML with term highlighting, and original document format. Oracle Text supports multiple languages and uses advanced relevance-ranking technology to improve search quality. Oracle Text also offers advanced features like classification, clustering, and support for information visualization metaphors”.

This article will focus on performing sentiment analysis using Oracle Text. Oracle Application Developer’s guide describes this as “Oracle Text enables you to perform sentiment analysis for a topic or document by using sentiment classifiers that are trained to identify sentiment metadata. Sentiment analysis is the process of identifying and extracting sentiment metadata about a specified topic or entity from a set of documents. Trained sentiment classifiers identify the sentiment. When you run a query with sentiment analysis, in addition to the search results, sentiment metadata (positive or negative sentiment) is also identified and displayed”.

In Oracle Text, a sentiment classifier is a specific type of document classifier that is used to extract sentiment metadata. Oracle text provides a default sentiment classifier (unsupervised). If you have suitable training data, Oracle Text also gives you the ability to train your own sentiment classifiers.

Please note that this article uses the Oracle Database Application Development virtual machine, running on a windows host for the demo.

Initial setup

  1. Download and install VirtualBox by following instructions listed on VirtualBox website.
  2. Download and set up Oracle Database Virtual Box Appliance / Virtual Machine by following instructions listed on the Oracle website.
  3. Start up the Virtual Machine and ensure SQL Developer is available on the desktop screen. Please note that you may need to create a free account to download the Virtual Machine.

Oracle DB Developer VM [Running] - Oracle VM VirtualBox

Figure 1. Oracle Database Developer Virtual Machine

  1. This article uses the same Excel file raw data from previous articles, available on my GitHub repository. Please download and save this file at a convenient location on your Virtual Machine

Set up an Oracle Text User

Set up a user with the CTXAPP role, enabling you to work with Oracle Text indexes and use Oracle Text PL/SQL packages.

  1. On the Virtual Machine, launch the Oracle SQL Developer application. Under the Connections menu, click on system >. When prompted for credentials, enter sys as sysdba in the username field and oracle in the password field.
  2. When a new worksheet opens, run this query select * from v$database ;. From the query result, verify the value under the NAME column is ORCLDB. You may also run the query select user from dual to verify the username of the current connection. It is important to verify this, as the subsequent steps for setting up the Oracle Text user need to be performed as the system administrator. The screenshot below unions the two queries for simplicity of display.

Oracle DB Developer VM [Running] - Oracle VM VirtualBox

Figure 2. Use Oracle SQL Developer to connect to ORCLDB database as “SYS” user

  1. Run the following SQL statement to create user redgatedemo.
CREATE USER redgatedemo IDENTIFIED BY mydemopassword;

Upon successful execution, you should see the message User REDGATEDEMO created.

  1. Run the following SQL statements to grant user redgatedemo the roles necessary for connecting and using database and CTXAPP resources. The CTXAPP role includes execute privileges on various CTX PL/SQL packages.
GRANT RESOURCE, CONNECT, CTXAPP TO redgatedemo;
GRANT UNLIMITED TABLESPACE TO redgatedemo;

Upon successful execution, you should see the message Grant succeeded.

Import raw data file in Oracle Table

Click the green plus sign under Connections to open the New / Select Database Connection dialog. Connect to the Oracle Database with user redgatedemo as shown in Figure 3.

Oracle DB Developer VM [Running] - Oracle VM VirtualBox

Figure 3. Connect to Oracle database with redgatedemo user

  1. Run the following SQL to create a table, with a unique index and primary key.
-- Create a Table to load text data
CREATE TABLE REDGATEDEMO.DEMO_SENTIMENTANALYSIS_TEAMHEALTH
(
    ID                     NUMBER GENERATED ALWAYS AS IDENTITY
   ,PERIOD                 VARCHAR2( 500 ) NOT NULL
   ,MANAGER                VARCHAR2( 500 ) NOT NULL
   ,TEAM                   VARCHAR2( 500 ) NOT NULL
-- this will store raw text of survey response
   ,RESPONSE               VARCHAR2( 4000 ) NOT NULL 
   ,DW_CREATION_DATE       DATE DEFAULT SYSDATE
);

-- create a unique index
CREATE UNIQUE INDEX REDGATEDEMO.DEMO_SENTIANALYSIS_TEAMHEALTH_PK
    ON REDGATEDEMO.DEMO_SENTIMENTANALYSIS_TEAMHEALTH(ID )
    ONLINE;

-- create a primary key
ALTER TABLE REDGATEDEMO.DEMO_SENTIMENTANALYSIS_TEAMHEALTH
    ADD CONSTRAINT DEMO_SENTIANALYSIS_TEAMHEALTH_PK PRIMARY KEY
    (ID )
USING INDEX REDGATEDEMO.DEMO_SENTIANALYSIS_TEAMHEALTH_PK ENABLE VALIDATE;

Image showing create table statement

Figure 4. Create table to load text data

  1. If you haven’t already, download the Excel file from the Github repo and save it to a convenient location on the Virtual Machine
  2. In Oracle SQL Developer, under the left side connection menu, navigate to Oracle Connections > redgatedemo > Tables > DEMO_SENTIMENTANALYSIS_TEAMHEALTH. Right click on this table and select Import Data

Oracle DB Developer VM [Running] - Oracle VM VirtualBox Import data

Figure 5. Launch the Data Import Wizard

  1. This will launch Data Import Wizard. Use this easy-to-follow five-step wizard to load the previously downloaded Excel file into the Oracle database table – REDGATEDEMO.DEMO_SENTIANALYSIS_TEAMHEALTH

Data Import Wizard

Figure 6. Step 1 of the Data Import Wizard

Please note Step 3 of the Data Import Wizard is Choose columns. Please ensure you have selected only the four columns Period, Manager, Team, and Response to import. Sometimes you may see a long list of empty excel columns with headings like column5, column6, etc. under the Available columns section of the wizard, which should not be imported.

Data import wizard choose columns

Figure 7. Step 3 of the Data Import Wizard (select columns to import)

The wizard will display a success message upon completion

Import complete

Figure 8. Data Import wizard completed successfully

Sentiment Classifier in Oracle text

In Oracle Text, a sentiment classifier is a specific type of document classifier that can extract sentiment metadata from a document. To use a sentiment classifier, it is first associated with a sentiment classifier preference, then trained.

Oracle documentation describes a sentiment classifier preference as “It specifies the parameters that are used to train a sentiment classifier. These parameters are defined as attributes of the sentiment classifier preference. You can either create a sentiment classifier preference or use the default”. The default sentiment classifier is CTXSYS.DEFAULT_SENTIMENT_CLASSIFIER. The CTX_DDL.CREATE_PREFERENCE procedure is used to create a user-defined sentiment classifier preference and the CTX_DDL.SET_ATTRIBUTE procedure is used to specify its parameters.

Oracle Text provides a default sentiment classifier, a pre-trained out-of-the-box model that works well for most general-purpose sentiment analysis applications. It leverages an AUTO_LEXER that performs language identification, word segmentation, document analysis, part-of-speech tagging, and stemming. In the world of text processing and analysis, a lexer reads an input text and breaks it down into tokens using language-specific grammar rules. The lexer generates a stream of tokens as output. Oracle’s AUTO_LEXER supports several languages and is most commonly used due to its ability to detect the language automatically. The default sentiment classifier is ideal for use in scenarios where a labeled training dataset is not available, and your sentiment classification needs are general purpose in nature (for example, social media posts, Google and Yelp reviews, etc.)

Oracle Text also provides the ability to create your own user-defined sentiment classifiers. To train your own sentiment classifier, a user needs an associated sentiment classifier preference, a training set of documents, and the target sentiment categories. The CTX_CLS.SA_TRAIN procedure is used to train a sentiment classifier. During training, a user assigns each sample document to a category, which Oracle Text uses to infer rules for how to perform sentiment analysis. This technique is an example of supervised machine learning and relies on suitable training data. You may need to train your own sentiment classifier if the default classifier does not meet the needs of your use-case (for example, complex text in a domain that needs deep subject matter understanding like medicine, international law, etc.). Training a user-defined sentiment classifier is out of scope for this article. To learn how to train your a user-defined sentiment classifier in Oracle Text, please follow this link to Oracle documentation.

Performing Sentiment Analysis using the default classifier

The CTX_DOC PL/SQL package is a part of Oracle Text features and provides several functions and procedures for various document services, including Sentiment. The following steps demonstrate the use of the CTX_DOC package for performing sentiment analysis, using the unsupervised technique (with default sentiment classifier and sentiment classifier preference)

  1. Run the following SQL statement to create a sentiment classifier preference demoautolexer of type AUTO_LEXER. Oracle Text will default values for attributes that are not explicitly defined. Please note that the type AUTO_LEXER should be used only for the default classifier. When training a user-defined classifier, always use the type SENTIMENT_CLASSIFIER.
exec ctx_ddl.create_preference('demoautolexer','AUTO_LEXER');

Create auto-lexer

Figure 9. create a sentiment classifier preference of type AUTO_LEXER

  1. Run the following SQL statement to create a context index on the document set (table with text data). The context parameters for this index specify the use of AUTO_LEXER and default stoplist. A stoplist refers to the list of most common words in a particular language, which are usually removed before processing natural language data. This step may take a few minutes to complete.
create index DEMO_SENTIMENTANALYSIS_TEAMHEALTH_TXTIDX 
on DEMO_SENTIMENTANALYSIS_TEAMHEALTH(RESPONSE) indextype is
ctxsys.context parameters 
('lexer demoautolexer stoplist ctxsys.default_stoplist');

 

create index using auto-lexer

Figure 10. Create text index

  1. You can now use this text index in a select statement using the SENTIMENT_AGGREGATE method of the CTX_DOC package. This method takes a document id as input, which in this example is the ID field (the primary key of table DEMO_SENTIMENTANALYSIS_TEAMHEALTH).

Oracle documentation describes the output of theCTX_DOC.SENTIMENT_AGGREGATE procedure as “a single consolidated sentiment score for the document. The sentiment score is a value in the range of -100 to 100, and it indicates the strength of the sentiment. A negative score represents a negative sentiment, and a positive score represents a positive sentiment. Based on the sentiment scores, you can group scores into labels such as Strongly Negative (–80 to –100), Negative (–80 to –50), Neutral (-50 to +50), Positive (+50 to +80), and Strongly Positive (+80 to +100).”

Run the following SQL statement to see the sentiment scores document id = 3 (Row with ID = 3):

SELECT ctx_doc.sentiment_aggregate('DEMO_SENTIMENTANALYSIS_TEAMHEALTH_TXTIDX','3') from dual;

Run package to see sentiment analysis on row 3

Figure 11. Use ctx_doc.sentiment_aggregate to generate sentiment score for document id = 3

  1. Run the following SQL statement to see the text as well as sentiment scores for document ids 3 and 4.
SELECT TO_CHAR(ID) DOC_ID, RESPONSE FROM DEMO_SENTIMENTANALYSIS_TEAMHEALTH WHERE ID = 3
UNION ALL
select 'Sentiment Score -> ', TO_CHAR(ctx_doc.sentiment_aggregate('DEMO_SENTIMENTANALYSIS_TEAMHEALTH_TXTIDX','3')) from dual
UNION ALL
SELECT TO_CHAR(ID) DOC_ID, RESPONSE FROM DEMO_SENTIMENTANALYSIS_TEAMHEALTH WHERE ID = 4
UNION ALL
select 'Sentiment Score -> ', TO_CHAR(ctx_doc.sentiment_aggregate('DEMO_SENTIMENTANALYSIS_TEAMHEALTH_TXTIDX','4')) from dual

Run sentiment analysis on Rows 3 and 4

Figure 12. SQL statement to select text and its sentiment score

The results of this SQL statement can be interpreted as:

  1. The sentiment for document id 3 with the text “we work well as a team, we have fun together, I think we are very healthy” is 31, which falls within the Neutral range of -50 to +50
  2. The sentiment for document id 4 with text “I fell pretty good about the health of our team. My main concerns re a lack of vision into what our application’s true purpose is and the future direction of our team.” is -28, which also falls in the within the Neutral range of -50 to +50 but is relatively more negative than the sentiment for document id 3. This is confirmed by human interpretation of the text.

A developer would run such SQL statement per document (per row in the table) using a looping operation weand wrapping it in a user-defined PL/SQL package.

Conclusion

This article demonstrated how to perform sentiment analysis with Oracle Text features available in an Oracle Database, using the default sentiment classifier that comes out of the box in Oracle. I highlighted the potential scenarios where the default sentiment classifier may not fulfill your needs, and you may have to train your own user-defined sentiment classifier. This article also helped readers get familiar with how to write a SQL statement to generate sentiment scores and interpret them to gain meaningful insights.

References:

https://docs.oracle.com/en/database/oracle/oracle-database/18/ccapp/understanding-oracle-text-application-development.html#GUID-CF13C01A-F5E6-4EF5-839B-C09CF0024D5E

 

The post Text Mining and Sentiment Analysis: Oracle Text appeared first on Simple Talk.



from Simple Talk https://ift.tt/38p5gqz
via

Monday, August 30, 2021

5 SSIS tricks you would like to know

How to make conditional decisions in a package

It may seems strange the lack of a decision task for the control flow. However, there is a work around to allow us achieve the same effect we would have with a “decision task”.

The precedence control between tasks can be based on an expression. Building the precedence control as an expression allow us to build an execution decision.

Some details about the implementation are important:

  • To achieve the result of an if/else we may need two precedence constraint with opposite expressions built.
  • You can use a sequence container to ensure the tasks on the if/else path will be completed before the following tasks continue.

How to map expressions in a Data flow

Data flow tasks don’t have the Expressions property to parameterize the properties of the task. This doesn’t mean we can’t parameterize the properties, we just need to do it in a different way.

Instead of looking for Expressions in the data flow task, we should look for it in the control flow task. The Expressions configuration in the control flow task will show to us the properties of every task inside the data flow, allowing us to configure expressions to all the properties.

 

How to use multiple tasks in a package part

Package parts are a great way to reuse pieces of the ETL process, but they have a limitation: Each package part can have a single control flow task.

There is an interesting and easy work-around for this: If we use a container task, such as a sequence container, this will still be a single task, but inside the container we will be able to add multiple control flow tasks.

 

 

How to use a variable across tasks in a package part

All the variables we create in a package part will have the scope of one of the tasks in it.

Package parts don’t have anything similar to a package level variable. What happens if we need to share values across many tasks in the package part.

The solution is simple, related to the previous problem: Create the variables using the scope of the sequence container we mentioned before. In this way, they will be available to all task in the sequence container.

 

How to map the package part connection manager to the package connection manager

Package parts can’t use the connection managers in the solution, they need to have their own connection managers.

When we insert the package part inside a package we can configure its connection manager and set the connection string.

We need to create a project parameter with the connection string and set all connections to use this parameter, including the package part. In this way we avoid duplicating the connection string everywhere.

The post 5 SSIS tricks you would like to know appeared first on Simple Talk.



from Simple Talk https://ift.tt/3mHWdJK
via

Wednesday, August 25, 2021

Machine learning, both exciting and scary

The 2002 movie Minority Report is about a police unit called PreCrime, which can predict when people will commit a crime so they can be arrested before it happens. Things go awry when a team member played by Tom Cruise is himself “pre-accused.”

To make predictions, they use technology and the special abilities of the three cops on the team. While we don’t have the capabilities of the PreCrime unit today, machine learning is being used to make predictions and influence behavior.

The most nefarious example involves a British company called Cambridge Analytica which improperly used Facebook data to influence the 2016 US presidential election. However, attempts to predict results and manipulate voters with data go back to the 1960s. From movie suggestions on Netflix to auto-correct to scanning faces instead of boarding passes, we can’t get away from artificial intelligence.

While some of this makes our lives more convenient, it feels unnerving thinking about all the collected and analyzed data about each of us. For example, you can see your Google activity on the myactivity page, and you may notice that ads are tailored to things you search and, possibly, what you talk about.

We are in an awkward position in the data platform community. Data science is an exciting and growing field that many of us work in or are training for as the next career move — even though we don’t want to be part of the data collected ourselves. The Data Platform WIT group recently held a Beginning Data Science day, and I have to admit that the topics were compelling.

Data scientists were mostly PhDs with years of training in the past, but now Microsoft and other organizations have “democratized AI.” With drag and drop interfaces, it’s easy to set up a machine learning experiment, but will it be accurate if the user doesn’t understand machine learning? On the other hand, does knowing how to do the same thing in Python make a difference if you don’t understand the data or have biases? For example, some face recognition apps only recognize white faces, and hundreds of algorithms for treating Covid-19 are worthless.

Machine learning is here to stay. As the tools become easier to use, it’s essential to ensure that those using them have the proper training and knowledge.

Commentary Competition

Enjoyed the topic? Have a relevant anecdote? Disagree with the author? Leave your two cents on this post in the comments below, and our favourite response will win a $50 Amazon gift card. The competition closes two weeks from the date of publication, and the winner will be announced in the next Simple Talk newsletter.

The post Machine learning, both exciting and scary appeared first on Simple Talk.



from Simple Talk https://ift.tt/3jg3TAC
via

Tuesday, August 24, 2021

A UNIQUE experience

If you look at the postings on SQL forums, you will notice that people are sloppy. They expect that the reader can read their minds and figure out everything that’s not been posted. They seldom post DDL, and when they do, it is also wrong. I don’t mean a little bit wrong. I mean fundamentally wrong. RDBMS has the concepts of (Relational) keys, so DDL without at least one key means you don’t have a table at all. If all the columns are NULL-able, they cannot ever be part of any key. This non-table is really a deck of 1960s punch cards hidden as an SQL table.

Dr. Codd defined a key as being a subset of columns in a table. A column is defined as an attribute of the entity that the table models. The IDENTITY table property is left over from sequential file days is not an attribute of an entity; it is a metadata attribute of the table. The GUID suffers the same problem of not being a proper attribute of the entities within the table in which they appear. Yes, they are UNIQUE, but they’re not attributes. Just sticking the constraint of PRIMARY KEY on them doesn’t avoid the fundamental problem. Novices forget that the “G” in GUID stands for “global,” and they are not local. Such attributes are specifically not in the table! These things cannot be keys by definition.

What is a key?

The definition of the key:

1) A key must be a non-empty subset of columns in the table (attributes of the entity). This means it cannot be an external pointer or physical locator to rows in the table. As an analogy, when you identify an automobile, you don’t use a parking space number in some particular garage; you use the VIN.

2) A key must be unique within the table. This is necessary but not sufficient. In fact, the best keys are unique across the whole schema and then some. This is another reason to use VIN for automobiles; it’s useful to the DMV, your insurance company, and any other use you should have for that automobile. It can be validated by using a really ugly regular expression or verified by physically going to your car and looking at the chassis or the etchings on your windshield. Validation and verification of data is a slightly different topic, so I’ll just skip over them for now.

3) A key must be precise, not approximate or vague. You don’t think about it, but this automatically eliminates approximate numeric types as keys. Two floating-point numbers are considered equal if they differ by a small value called the epsilon. This means it two different floating-point numbers can be treated as equal. This need for precision is also the reason that NULLs are not allowed in a key. Remember that a NULL is not a value. I’ll get into some of the problems with NULLs shortly.

A bit of history

When Dr. Codd created the relational model, data was still in file systems and a few early networks and hierarchical databases. Our mindset was still stuck in these pre-relational tools, so the good doctor came up with the idea that every table should have a PRIMARY KEY. It was defined as a key that we anointed as somehow special. A table could only have one of these (Wow! Just like the sort order in sequential files!).

In the early days of the SQL Standards, we considered requiring a PRIMARY KEY on a table but decided against it. The idea was that legacy data without keys (think mag tapes and punch cards) would be easier to move into SQL products without that constraint.

Shortly after that, Dr. Codd realized that a key is a key. There was no need to mark one as special. This is what led to the UNIQUE constraint.

UNIQUE and PRIMARY KEY constraints

The UNIQUE and PRIMARY KEY constraints say that no duplicate values are allowed in the column. It comes in two forms. The syntax is:

<UNIQUE specification> ::= UNIQUE | PRIMARY KEY

There are some subtle differences between UNIQUE and PRIMARY KEY. There still can be only one PRIMARY KEY per table but many UNIQUE columns. A PRIMARY KEY is automatically declared to have a NOT NULL constraint on it, but a UNIQUE column can have a single NULL in it unless you explicitly add a NOT NULL constraint. Adding the NOT NULL whenever possible is a good idea, as it makes the column into a proper relational key. I also add NOT NULL to PRIMARY KEY declarations to document the table and to be sure it stays there when the key changes.

File system programmers understand the concept of a PRIMARY KEY, but for the wrong reasons. Their mindset is a sequential file, which can have only one key, because that key is used to determine the physical sort order of the records within the file. There is no ordering in a table; the term PRIMARY KEY in SQL has to do with defaults in referential actions, which is another topic.

A single NULL is allowed in the UNIQUE constraint to be able to match a NULL as if it were a value. The same type of logic is applied to the GROUP BY clause so that each NULL doesn’t become its own group. There’s a cute story about this decision that has to do with traffic tickets. Before SQL, someone got a prestige tag reading missing and was inundated by hundreds of traffic tickets because that’s what the police officers wrote for the missing auto tag everywhere in California.

Multi-column UNIQUE constraints

There is also a multiple-column form of the <UNIQUE specification>, which is usually written at the end of the column declarations. It is a list of columns in parentheses after the appropriate keyword; it means that the combination of those columns is unique.

Broadly speaking, there are two patterns for multi-column keys. One pattern is like coordinates, where each attribute is strong and has meaning in its own right. For example, I might declare PRIMARY KEY (city, department) so I can be sure that although I have offices in many cities and many identical departments in those offices, there is only one personnel department in Chicago.

The other pattern is hierarchical: one attribute is subordinate to another attribute. A classic example is an order number and an order detail number subordinated to that particular order. This has to do with weak and strong entities, and that’s another topic.

Super keys

A super key is a key that is too big. It has more columns in it than are needed to identify an entity in a table. In practice, you generally don’t like these things. The indexes or hash tables or whatever your SQL engine uses to implement uniqueness will have to do extra work carrying the redundant attributes. The problem is it isn’t always obvious which attributes are redundant. Imagine a super key UNIQUE (a, b, c, d) In which any three of the four columns is also unique. Which possible combination(s) do you want to keep?

Overlapping keys

You can also have overlapping keys, as in UNIQUE (a, b, c) and UNIQUE (b, c, d), which share columns (b, c).

Nested unique constraints

One of the basic tricks in SQL is representing a one-to-one or many-to-many relationship with a table that references the two (or more) entity tables related by their primary keys. This third table has several popular names such as “junction table,” “Associative Entity,” or “join table,” but we know that it is a relationship. The term “junction table” is a pointer structure from network databases, not part of an RDBMS. For example, given two tables for a Cafeteria database:

CREATE TABLE Entrees
(entree_name VARCHAR(30) NOT NULL PRIMARY KEY
 ...);
CREATE TABLE Sidedishes
(sidedish_name VARCHAR(30) NOT NULL PRIMARY KEY,
 ... );

Yes, I know using names for a key is a bad practice, but it will make my examples easier to read. The simplest relationship table looks like this:

CREATE TABLE Meals
(entree_name VARCHAR(30) NOT NULL
  REFERENCES Entrees (entree_name),
 sidedish_name VARCHAR(30) NOT NULL,
  REFERENCES Sidedishes(sidedish_name));

The Meals table allows creating simple one entree/one side dish meals insert rows like this:

INSERT INTO Meals
VALUES
('Meat Loaf', 'Mashed Potatoes'),
('Meat Loaf', 'Green Beans'),
('Fried Chicken', 'Green Beans'),
('Meat Loaf', 'Mashed Potatoes');

Oops! ‘Meat Loaf’ is shown twice with ‘Mashed Potatoes’ because the Meals table does not have its own key. This is an easy mistake to make, but fixing it is not obvious.

CREATE TABLE Menu
(entree_name VARCHAR(30) NOT NULL
  REFERENCES Entrees (entree_name),
 sidedish_name VARCHAR(30) NOT NULL,
  REFERENCES Sidedishes(sidedish_name),
 PRIMARY KEY (entree_name, sidedish_name)); -- compound key

The Menu table gets rid of the duplicated rows and makes this a proper table. The PRIMARY KEY for the table is made up of two or more columns and is called a compound key because of that fact. These are valid rows now.

('Meat Loaf', 'Mashed Potatoes')
('Meat Loaf', 'Green Beans')
('Fried Chicken', 'Green Beans')

But the only restriction on the Meals is that they appear only once. Every entree can be paired with every side dish. I want a rule that entrees can have as many side dishes as they want, but the side dishes have to stick to one entree.

The way I do this is to use a NOT NULL UNIQUE constraint on the sidedish_name column, which makes it a key. It is a simple key since it is only one column, but it is also a nested key because it appears as a subset of the compound PRIMARY KEY.

CREATE TABLE Plate_Combos
(entree_name VARCHAR(30) NOT NULL
  REFERENCES Entrees (entree_name),
 sidedish_name VARCHAR(30) NOT NULL UNIQUE, -- nested key
  REFERENCES Sidedishes(sidedish_name),
 PRIMARY KEY (entree_name, sidedish_name)); -- compound key

Plate_Combos is a proper table, without duplicated rows, but it also enforces the condition that each entree can have one or more side dishes.

('Meat Loaf', 'Mashed Potatoes')
('Meat Loaf', 'Green Beans')

Instead, you might want the side dishes to be combined with more than one entree.

CREATE TABLE Single_Side_Plates
(entree_name VARCHAR(30) NOT NULL UNIQUE -- nested key
  REFERENCES Entrees (entree_name),
 sidedish_name VARCHAR(30) NOT NULL,
  REFERENCES Sidedishes(sidedish_name),
 PRIMARY KEY (entree_name, sidedish_name)); -- compound key

The Single_Side_Plates table would permit these rows from the original set.

('Meat Loaf', 'Green Beans')
('Fried Chicken', 'Green Beans')

Think about all of these possible keys for a minute. The compound PRIMARY KEY is now redundant. If each entree appears only once in the table or each side dish appears only once in the table, then each (entree_name, sidedish_name) pair can appear only once. A rather dull option of a single entree with a single side dish can be enforced with this DDL:

CREATE TABLE Lunch_Plates
(entree_name VARCHAR(30) NOT NULL UNIQUE -- nested key
  REFERENCES Entrees (entree_name),
 sidedish_name VARCHAR(30) NOT NULL UNIQUE -- nested key,
  REFERENCES Sidedishes(sidedish_name),
 PRIMARY KEY(entree_name, sidedish_name)); -- redundant compound key

The Meals table allows inserting these rows from the original set.

('Meat Loaf', 'Mashed Potatoes')
('Fried Chicken', 'Green Beans')

Making special provisions for the PRIMARY KEY in the SQL engine is not a bad assumption because the REFERENCES clause uses the PRIMARY KEY of the referenced table as the default. Many new SQL programmers are unaware that a FOREIGN KEY constraint can also reference any UNIQUE constraint in the same table or another table. Such things can get very complicated and require that you defer constraints. Again, that’s another topic.

Overlapping keys

Getting back to the nested keys, just how far can you go with them? My favorite example is a teacher’s schedule kept in a table like this (I am leaving off REFERENCES clauses and CHECK() constraints):

CREATE TABLE Class_Schedule -- skeleton table. WRONG!
(teacher_name VARCHAR(15) NOT NULL,
 class_title CHAR(15) NOT NULL,
 room_nbr INTEGER NOT NULL,
 period_nbr INTEGER NOT NULL,
 PRIMARY KEY (teacher_name, class_title, room_nbr, period_nbr));

That choice of a PRIMARY KEY is the most obvious one — use all the columns. Typical rows would look like this:

('Mr. Celko', 'Database 101', 222, 6)

The rules to be enforced are:

1) A teacher is in only one room each period.

2) A teacher teaches only one class each period.

3) A room has only one class each period.

4) A room has only one teacher in it each period.

Stop reading and see what you come up with for an answer. Okay, now consider using one constraint for each rule in the list.

CREATE TABLE Class_Schedule_1 -- version one, still WRONG!
(teacher_name VARCHAR(15) NOT NULL,
 class_title CHAR(15) NOT NULL,
 room_nbr INTEGER NOT NULL,
 period_nbr INTEGER NOT NULL,
 UNIQUE (teacher_name, room_nbr, period_nbr), -- rule #1
 UNIQUE (teacher_name, class_title, period_nbr), -- rule #2
 UNIQUE (class_title, room_nbr, period_nbr), -- rule #3
 UNIQUE (teacher_name, room_nbr, period_nbr), -- rule #4
 PRIMARY KEY (teacher_name, class_title, room_nbr, period_nbr));

There are four ways to pick three things from a set of four things; it is called a combination. I could drop the PRIMARY KEY as redundant if I have all four of these constraints in place. But what happens if I drop the PRIMARY KEY and then one of the constraints?

CREATE TABLE Class_Schedule_2 -- still wrong
(teacher_name VARCHAR(15) NOT NULL,
 class_title CHAR(15) NOT NULL,
 room_nbr INTEGER NOT NULL,
 period_nbr INTEGER NOT NULL,
 UNIQUE (teacher_name, room_nbr, period_nbr), -- rule #1
 UNIQUE (teacher_name, class_title, period_nbr), -- rule #2
 UNIQUE (class_title, room_nbr, period_nbr)); -- rule #3

I can now insert these rows in the second version of the table:

('Mr. Celko', 'Database 101', 222, 6)
 ('Mr. Celko', 'Database 102', 223, 6)

This gives me a very tough sixth-period teaching load since I have to be in two different rooms at the same time. Things can get even worse when another teacher is added to the schedule:

('Mr. Celko', 'Database 101', 222, 6)
 ('Mr. Celko', 'Database 102', 223, 6)
 ('Ms. Shields', 'Database 101', 223, 6)

Ms. Shields and I are both in room 223, trying to teach different classes at the same time. Matthew Burr looked at the constraints and rules and came up with this analysis.

CREATE TABLE Class_Schedule_3 -- corrected version
(teacher_name VARCHAR(15) NOT NULL,
 class_title CHAR(15) NOT NULL,
 room_nbr INTEGER NOT NULL,
 period_nbr INTEGER NOT NULL,
 UNIQUE (teacher_name, period_nbr), -- rules #1 and #2
 UNIQUE (room_nbr, period_nbr)); -- rules #3 and #4

If a teacher is in only one room each period, then given a period and a teacher, I should be able to determine only one room, i.e., the room is functionally dependent upon the combination of teacher and period. Likewise, if a teacher teaches only one class each period, then class is functionally dependent upon the combination of teacher and period. The same thinking holds for the last two rules: class is functionally dependent upon the combination of room and period, and teacher is functionally dependent upon the combination of room and period.

With the constraints provided in the first version, you will find that the rules are not enforced. For example, I could enter the following rows:

('Mr. Celko', 'Database 101', 222, 6)
 ('Mr. Celko', 'Database 102', 223, 6)

These rows violate rule #1 and rule #2.

However, the UNIQUE constraints first provided in Class_Schedule_2 do not capture this violation and will allow the rows to be entered.

The constraint

UNIQUE (teacher_name, room_nbr, period_nbr)

is checking the complete combination of teacher, room, and period. Since (‘Mr. Celko‘, 222, 6) is different from (‘Mr. Celko‘, 223, 6), the DDL does not find any problem with both rows being entered, even though that means that Mr. Celko is in more than one room during the same period.

UNIQUE (teacher_name, class_title, period_nbr)

doesn’t catch its associated rule either since (‘Mr. Celko‘, ‘Database 101‘, 6) is different from (‘Mr. Celko‘, ‘Database 102‘, 6). Mr. Celko can teach more than one class during the same period, thus violating rule two. It seems that it’s also possible to add the following row:

('Ms. Shields', 'Database 103', 222, 6)

which violates rules #3 and #4.

Conclusion

Try to imagine enforcing this with procedural code. This is why I say that most of the work in SQL is done in the DDL. You will not always need complicated, nested, overlapping uniqueness constraints, but they can be incredibly powerful when you do.

 

The post A UNIQUE experience appeared first on Simple Talk.



from Simple Talk https://ift.tt/3D5zeOB
via

Monday, August 23, 2021

SQL Server identity column

When designing a table for a database, a column might need to be populated with a different number on every row inserted. An identity column might be a good way to automatically populate a numeric column each time a row is inserted. In this article, I will discuss what a SQL Server identity column is and how it works.

What is a SQL Server identity column?

An identity column is a numeric column in a table that is automatically populated with an integer value each time a row is inserted. Identity columns are often defined as integer columns, but they can also be declared as a bigint, smallint, tinyint, or numeric or decimal as long as the scale is 0. An identity column also can not be encrypted using a symmetric key, but can be encrypted using Transparent Data Encryption (TDE). Additionally an identity column’s definitions must not allow null values. One possible drawback of using an identity column is that only one identity column per table can be used. If more than one numeric field must be populated automatically per table, consider looking at the sequence object, which is outside the scope of this article.

The values automatically generated for each row inserted are based on the seed and an increment property of the identity column. The following syntax is used when defining an identity column:

IDENTITY [ (seed , increment) ]

Seed is the first value loaded into the table, and increment is added to the previous identity value loaded to create the next subsequent value. Both seed and increment values need to be supplied together if you wish to override the defaults. If no seed and increment values are provided, then the default values for seed and increment are both 1.

Defining identity column using a CREATE TABLE statement

When a table is designed, most data architects will create the layout, so the first column in the table is the identity column. In reality, this is only a standard practice and not a requirement of an identity column. Any column in a table can be an identity column, but there can only be one identity column per table. Script 1 creates a new table named Widget that contains an identity column.

Script 1: Creating a table with an identity column

CREATE TABLE Widget
(
    WidgetID int identity(1,1) not null,
    WidgetName varchar(100) not null, 
    WidgetDesc varchar(200) not null
);

The WidgetID is the identity column with a seed value of 1 and an increment value of 1.

The seed value determines the identity value for the first row inserted into a table. The increment value is used to determine the identity value of subsequent rows inserted into the table. For each row inserted after the first row, the increment value is added to the current identity value to determine the identity value for the new row being added. The current identity value is an integer value for the identity column of the last row inserted into the table. To see how this works, run Script 2.

Script 2: Code to insert and display three rows added to table Widget

INSERT INTO Widget VALUES 
('thingamajig','A jig you cannot remember'),
('doodad','A hair style you cannot remember'), 
('whatchamacallit', 'A thing for which you cannot remember');
SELECT * FROM Widget;

When the code in Script 2 runs, the output in Report 1 is displayed.

Report 1: Output when Script 2 is run

An images showing the results of script 2

In Script 2, three rows are inserted into the newly created Widget table. My script only provided column values for the WidgetName, and WidgetDesc columns and didn’t provide values for the WidgetID column. The value for the WidgetID column for the first row inserted was based on the seed defined in the CREATE TABLE statement, which was identified in Script 1. The WidgetID value 2 for the row with a WidgetName of doodad was created by adding the increment value of 1 to the last identity value inserted. The WidgetID value of 3, for the row with whatchamacallit. WidgetName, got its identity value by adding 1 to the identity value used on the second row insert.

Remember that the seed and increment values do not have to be 1 and 1, respectively; they could be whatever values are appropriate for the table. For example, a table could use a seed value of 1000 and an increment of 10, as I have done for the WidgetID column.

Script 3: Using different seed and increment value

CREATE TABLE DifferentSeedIncrement
(
    ID int identity(1000,10),
    A varchar(100), 
    B varchar(200)
);

The ID column in Script 3 doesn’t have the NOT NULL property identified, in the CREATE TABLE statement, as I had done with the identity column defined in Script 1. The not null column requirement can be left off because, behind the scenes, the database engine will automatically add the NOT NULL property for any identity column being created.

I’ll leave it up to you to run the code in Script 3 and insert a few rows in the DifferentSeedIncrement table. This way, you can see for yourself that the ID values generated for each new row are inserted into the DifferentSeedIncrement table and how the table is defined to SQL Server.

Uniqueness of an identity column

Creating an identity column on a table doesn’t mean an identity value will be unique. The reason identity column values might not be unique is that SQL Server allows identity values to be manually inserted, as well the seed value can be reset. I will be covering both the inserting identity values and resetting the seed value concepts in a follow up article. The SQL Server documentation clearly states that uniqueness must be enforced by using a primary key, unique constraint, or unique index. Therefore, to guarantee that an identity column only contains unique values, one of the aforementioned objects must force uniqueness for each value in an identity column.

Identifying identity columns and their definitions in database

There are a number of ways to identify the identity columns and their definitions in a database. One way is to use SQL Server Object Explorer, however, the identity column can’t be determined by just displaying the columns in a table, as shown in Figure 1

An image showing Object Explorer

Figure 1: Displaying column definitions for tables created by Script 1 and Script 3

To determine which column is actually an identity column, the column’s properties need to be reviewed. To show the properties of a column, right-click on the column in Object Explorer and then click on the Properties item from the drop-down context menu. Figure 2 shows the properties of the WidgetID column in the Widget table.

An image showing a SQL Server identity column properties

Figure 2: Properties the dbo.Widget.WidgetId column

If the Identity property has a value of True, then the column is an identity column. The seed and increment values are also displayed.

Using the Object Explorer properties method in a database with lots of tables might take a while to determine which columns are identity columns. Another method to display all the identity columns in a database is to use the sys.identity_column view, as shown in the TSQL code in Script 4.

Script 4: Script to display all identity values in a database

SELECT 
        OBJECT_SCHEMA_NAME(tables.object_id, db_id())
        AS SchemaName,
        tables.name As TableName,
        identity_columns.name as ColumnName,
        identity_columns.seed_value,
        identity_columns.increment_value,
        identity_columns.last_value
FROM sys.tables tables 
        JOIN sys.identity_columns identity_columns 
ON tables.object_id=identity_columns.object_id
GO

Script 4 returns the output in Report 2.

Report 2: Output when script 4 is run

An image showing the results of script 4

Note that the last_value column for the TableName of DifferentSeedIncrement has a value of NULL. This means no rows have been inserted into this table to be able to set the LastValue.

Adding an identity column to an existing table

An existing column cannot be altered to make it an identity column, but a new identity column can be added to an existing table. To show how this can be accomplished, run the code in Script 5. This script creates a new table, adds two rows, and then alters the table to add a new identity column.

Script 5: Adding an identity column

CREATE TABLE Invoices
(
    InvoiceDate date, 
    InvoiceNumber varchar(100),
    PayTo varchar (100)
);
INSERT INTO Invoices VALUES
(getdate(), 'GL_0001', 'Greg Larsen'),
(getdate(), 'GL_0002', 'Greg Larsen');
-- Add Identity Column
ALTER TABLE Invoices  
   ADD InvoiceID int identity;
-- Review Rows
SELECT * FROM Invoices;

The output of Script 5 is shown in Report 3.

Report 3: Rows in Invoices table

An image showing the results of script 5

Report 3 shows that the new InvoiceID column was added, the identity values for this column were automatically populated on all existing rows.

Altering an existing table to define an identity column

As already stated, SQL Server does not allow using the ALTER TABLE/ALTER COLUMN command to change an existing column into an identity column directly. However, there are options to modify an existing table column to be an identity column. The following example shows an option that uses a work table to alter a column in an existing table to be an identity column.

To accomplish modifying an existing column to be an identity column, the script uses the ALTER TABLE … SWITCH command. The SWITCH option was added to the ALTER TABLE statement in SQL Server 2005 as part of the partitioning feature. The TSQL code in Script 6 uses a temporary work table and the SWITCH option to support altering an existing column to make it an identity column.

Script 6: Altering an existing column to be an identity column

DROP TABLE Invoices -- clean up from prior example
GO
-- Step 1: Create Invoices table and populate with data
CREATE TABLE Invoices
(   
    InvoiceID int NOT NULL, 
    InvoiceDate date, 
    InvoiceNumber varchar(100),
    PayTo varchar (100)
);
INSERT INTO Invoices VALUES
(1, getdate(), 'GL_0001', 'Greg Larsen'),
(2, getdate(), 'GL_0003', 'Greg Larsen');
-- Step 2: create temporary work table with same schema, but has identity column
CREATE TABLE Invoices2
(   
    InvoiceID int identity(1,1), 
    InvoiceDate date, 
    InvoiceNumber varchar(100),
    PayTo varchar (100)
);
-- Step 3: Switch Tables, drop original, and rename
 ALTER TABLE Invoices SWITCH TO Invoices2;
 -- drop original table
 DROP TABLE Invoices;
 -- Rename temp table back to original table name
 EXEC sp_rename 'Invoices2','Invoices';  
-- Step 4: Update the current seed value for new Invoices table
 DBCC CHECKIDENT('Invoices');

Script 6 went through 4 steps to alter an existing column to be an identity column. Below are some things to consider when using this method to add an identity column to an existing table:

To use the SWITCH option on the ALTER TABLE statement, the column being changed to an identity column on the original table must not allow nulls. If it allows null, then the switch operations will fail.

Make sure to reseed the identity column of the new table using the DBCC CHECKIDENT command. If this is not done, then the next row inserted will use the original seed value, and duplicate identity values could be created if there is not a primary key, or unique constraint, or unique index on the identity column.

All foreign keys will need to be dropped prior to running the ALTER TABLE …SWITCH command.

If indexes exist on the original table, then the temporary table will also need the exact same indexes, or the switch operation will fail.

While the ALTER TABLE …SWITCH command is running, there must be no transactions running against the table. All new transactions will be prevented from starting while the switch operation is being performed.

When switching tables, security permissions could be lost because the security permissions are associated with the target table when a switch operation is performed. Therefore, make sure permissions of the original table are recreated on the target table either before or shortly after the switch operation.

Reseeding an identity column

In the previous example, I reseeded the identity column value by using the DBCC CHECKIDENT statement. There are other reasons why an identity column value might need to be reseeded, like when several rows were incorrectly inserted into a table, or erroneous rows were deleted. Mistakenly inserting erroneous rows causes the current identity to be increased for each row added. Therefore, after all the bad rows have been deleted, the next row will use the next identity value and leave a large gap in identity values. If this mistake has been made, then reseeding the identity value ensures there isn’t a big gap of missing identity values.

The DBCC CHECKIDENT command is used to reseed an identity value for a table. This command has the following syntax:

DBCC CHECKIDENT
 (
    table_name  
        [, { NORESEED | { RESEED [, new_reseed_value ] } } ]  
)  
[ WITH NO_INFOMSGS ]

The table_name parameter is the name of the table that contains an identity specification. The table must contain an identity column, otherwise, an error will occur when the DBCC CHECKINDENT command is run. If no other options are added along with this command, then the current identity value will be reset to the maximum value found in the existing identity column.

The NORESEED option specifies not to change the seed value. This option is useful to determine the current and maximum identity value. If the current and maximum values are different, then the identity value should be reseeded.

When the current identity value is less than the maximum, or there is a large gap in the identity values, the RESEED option can be used to reset the current identity value. The RESEED option can be specified with or without a new_reseed_value. When no new_reseed_value is specified, the current identity value will be set to the maximum value stored in the identity column of the table specified.

Script 6 shows how to reseed an identity column value using DBCC CHECKINDENT command without using the RESEED option. The TSQL code in script 7 shows how to set the current seed value to 2 using the RESEED option.

Script 7: Using the RESEED option

DBCC CHECKIDENT('Invoices',RESEED,2);

Be careful with using the RESEED option with a new seed value. SQL Server doesn’t care what value is used for the new seed. If the new seed is set to a value less than the maximum seed value in the table, duplicate identity values might be created.

The SQL Server identity column

An identity column will automatically generate and populate a numeric column value each time a new row is inserted into a table. The identity column uses the current seed value along with an increment value to generate a new identity value for each row inserted. This article only covered some fundamental aspects of using the identity column. In a future article, more nuances of the identity columns will be explored.

If you liked this article, you might also like Improving performance with instant file initialization.

 

The post SQL Server identity column appeared first on Simple Talk.



from Simple Talk https://ift.tt/2UJboH2
via

Change Delete Behavior and more on EF Core

Entity Framework had objects called conventions. These objects can be used to make configurations on the entity framework context to change its default behavior.

Two examples of behaviors we usually configure are the pluralize behavior for object names and delete cascade for foreign keys.

Entity Framework Core, on the other hand, doesn’t have conventions. How could we achieve the same configuration on entity framework core?

Configuring Conventions in Entity Framework Core

Instead of depending on conventions we need to make the configurations in a “manual” way. It means setting the configuration on each object of the schema as needed.

We may need a bit more code and also need to take care to execute this code after the definition of the schema. In this way the behaviors we are configuring will apply to all the objects in the schema.

This is an example of the OnModelCreating applying the code to avoid the pluralize behavior and remove the delete cascade behavior from all foreign keys:

 

        protected override void OnModelCreating(ModelBuilder modelBuilder)
        {
            if (modelBuilder == null)
            throw new ArgumentNullException("modelBuilder");

            // for the other conventions, we do a metadata model loop
            foreach (var entityType in modelBuilder.Model.GetEntityTypes())
            {
                // equivalent of modelBuilder.Conventions.Remove<PluralizingTableNameConvention>();
                entityType.SetTableName(entityType.DisplayName());

                // equivalent of modelBuilder.Conventions.Remove<OneToManyCascadeDeleteConvention>();
                entityType.GetForeignKeys()
                    .Where(fk => !fk.IsOwnership && fk.DeleteBehavior == DeleteBehavior.Cascade)
                    .ToList()
                    .ForEach(fk => fk.DeleteBehavior = DeleteBehavior.Restrict);
            }

            base.OnModelCreating(modelBuilder);
        }

 

Creating Extension Methods

We can do a bit better, turning this code more reusable using extension methods. Our first attempt to use extension methods can be like this:

 

public static class ContextExtensions
{
    public static void RemovePluralizeBehavior(this ModelBuilder builder)
    {
        builder.EntityLoop(et => et.SetTableName(et.DisplayName()));
    }

    public static void RemoveOneToManyCascade(this ModelBuilder builder)
    {
        builder.EntityLoop(et => et.GetForeignKeys()
            .Where(fk => !fk.IsOwnership && fk.DeleteBehavior == DeleteBehavior.Cascade)
            .ToList()
            .ForEach(fk => fk.DeleteBehavior = DeleteBehavior.Restrict));
    }


    private static void EntityLoop(this ModelBuilder builder, Action<IMutableEntityType> action)
    {
        foreach (var entityType in builder.Model.GetEntityTypes())
        {
            action(entityType);
        }
    }
}

 

Mind the interesting way I ensured the loop would be built only once and re-used, parameterizing the loop with Actions/Lambda expressions.

Once we built the extension methods, the OnModelCreating will become like this:

 

        protected override void OnModelCreating(ModelBuilder modelBuilder)
        {
            if (modelBuilder == null)
                throw new ArgumentNullException("modelBuilder");

            modelBuilder.AddRemovePluralizeConvention();
            modelBuilder.AddRemoveOneToManyCascadeConvention();

            modelBuilder.ApplyConventions();

            base.OnModelCreating(modelBuilder);
        }

Optimizing the Extension

This code could still be improved, because in the way it is the same loop is being repeated twice. We can make some changes to our extension to avoid this. The new code will be like this:

 

public static class ContextExtensions
{

    private static List<Action<IMutableEntityType>> Conventions=new List<Action<IMutableEntityType>>();

    public static void AddRemovePluralizeConvention(this ModelBuilder builder)
    {
         Conventions.Add(et => et.SetTableName(et.DisplayName()));
    }

    public static void AddRemoveOneToManyCascadeConvention(this ModelBuilder builder)
    {
        Conventions.Add(et => et.GetForeignKeys()
            .Where(fk => !fk.IsOwnership && fk.DeleteBehavior == DeleteBehavior.Cascade)
            .ToList()
            .ForEach(fk => fk.DeleteBehavior = DeleteBehavior.Restrict));
    }

    public static void ApplyConventions(this ModelBuilder builder)
    {
        foreach (var entityType in builder.Model.GetEntityTypes())
        {
            foreach(Action<IMutableEntityType> action in Conventions)
                action(entityType);
        }

        Conventions.Clear();
    }
}

The new OnModelCreating will be like this:

 

        protected override void OnModelCreating(ModelBuilder modelBuilder)
        {
            if (modelBuilder == null)
                throw new ArgumentNullException("modelBuilder");

            modelBuilder.AddRemovePluralizeConvention();
            modelBuilder.AddRemoveOneToManyCascadeConvention();

            modelBuilder.ApplyConventions();

            base.OnModelCreating(modelBuilder);
        }

Conclusion

.NET core avoid some pre-built classes such as the conventions, but we still can create our own libraries to reuse and simplify the code in our own way.

I made these extension methods available on GitHub, you can use, fork, improve and include in your artifacts

The post Change Delete Behavior and more on EF Core appeared first on Simple Talk.



from Simple Talk https://ift.tt/3gqxFB6
via

Friday, August 20, 2021

Oracle sequences – 12c features including “identity”

The series so far:

  1. Oracle sequences: The basics
  2. Oracle sequences: RAC
  3. Typical uses of Oracle sequences

The final installment of this mini-series on sequences looks at the new features relating to sequences that appeared in the 12c timeline in response, possibly, to some of the more frequently voiced complaints of users and developers. In particular, the article shows how it is now possible to use synthetic keys for a table through a sequence that is implicitly created and tied to that table by a static declaration.

Feature round-up

Since Oracle 12.2, and all the way up to 21c, the complete syntax of the create sequence command (excluding only the details relating to creating sequences that can be shared from an application root to its application PDBs) has been as follows:

Image showing the create sequence syntax of Oracle 12c sequences

There is a new option for altering a sequence that appeared in 12.1 but was not documented until 18c:

alter SEQUENCE [schema_name.]{sequence_name} restart;

The significant new feature that goes along with sequences, though, is that you can (at last!) define one column in a table to be an identity column. Having an identity column in a table means a sequence will be created and linked to the column behind the scenes so that inserts into the table will automatically use the sequence nextval to populate that column when needed.

Basic options

Of the four newer options highlighted above for create sequence, I’ve covered the scale/extend option in an earlier installment discussing RAC. If you specify scale, the sequence generator uses six leading digits to include the instance and session information. Unless you include the extend option, these six digits will be considered as contributing to your stated maxvalue – which means your effective range of values has been reduced by a factor of a million.

The shard option does for sharded databases what the scale option does for RAC instances. It uses the four leading digits of the generated value for the shard id – presumably to allow you to avoid cross-shard updates taking place when sequence values are used in indexes. As with scale, shard can include the extend option, but if you use both scale and shard, then extend needs only appear once in the identity declaration.

It’s worth noting that the manuals advise you against using the order option if you’ve used the scale or shard options. Since their purpose is to minimize cross instance and cross shard calls, this makes perfect sense as the order option would result in the instances/shards constantly communicating with each other to pass the current high value around.

The keep option is a detail you need to know about if you’re implementing Oracle’s Application Continuity features. Since nextval (like sysdate, systimestamp, and a couple of others) is not a transactional value, Oracle needs to do something special to ensure that transactions can be replayed correctly in the event of an outage. To allow this to happen, the sequence needs to be declared with the keep option, and application users need the keep sequence privilege, granted through a command like:

grant keep sequence on {sequence name} to {user name};

Finally, there is the global/session option. The default is global, which behaves in the way that create sequence has always behaved. The session option means the sequence will be a “temporary” sequence (rather like a global temporary table). When a session selects nextval from a session sequence, it will create a private copy of the sequence from the sequence metadata and work only with that private copy. If multiple sessions are using the same session sequence, they will all have their own private copies and all start at the same starting value without interfering with each other.

There are several benefits to session sequences when it’s appropriate to use them). You no longer have to worry about updates to the seq$ table (which is particularly convenient for RAC systems), and you can now use sequences on read-only databases (provided the database was read-write at some point so that you could create the sequence).

This feature may be particularly relevant to systems using Active Data Guard. With a suitable sequence definition at the primary database, a standby database can request a range (dictated by the cache size) of values from the primary that will not duplicate values that will be used by the primary. If you can define the sequence as a session sequence at the primary database, the standby no longer has to access the primary but starts by accessing its local definition. Taking this further: with a global temporary table (originally defined at the primary, of course), and with temp_undo_enabled set to true (which is the default for ADG standbys), you can even execute DML like the following on GTTs:

insert into my_gtt1 select session_seq1.nextval, {other columns} from …

I have to admit that I haven’t set up a system to look for side effects and unexpected overheads in this approach – so do test it and examine the session events and session activity stats very carefully before taking it to production.

Another benefit of the session-private sequence is that since you are no longer competing for a public sequence cache, the latch contention due to heavy usage of a very popular sequence will be reduced. I was a little disappointed, however, when I tested the feature on 19.11.0.0: a global sequence needed 3 latch gets per call to nextval while a session sequence still needed 2 latch gets – an improvement, but not as much of an improvement as I had assumed that I would see. Possibly the latch gets are currently needed to check that the sequence definition hasn’t changed since the last call.

A detail to remember when you define a session sequence is that some standard sequence options (cache/nocache, order/noorder) are irrelevant and will be ignored.

Restart

One of the FAQs about sequences used to be “how can I reset the sequence?” The answer was usually something along the lines of

  1. work out the difference between the current value and the start with value,
  2. alter the increment by to the negative of that difference
  3. select nextval from dual;
  4. alter the increment by back to its original value

This method was a little risky since someone else might manage to slip in a select nextval in the brief gap between steps 2 and 4, leaving you with a sequence that was nowhere near where you wanted it to be. In 12.1.0.2, one of the options in the alter sequence command is to restart.

Unfortunately, if you try this on a base-level 12.1.0.2, you will find that Oracle raises an error:

SQL> alter sequence s1 restart;
alter sequence s1 restart
*
ERROR at line 1:
ORA-00604: error occurred at recursive SQL level 1
ORA-01407: cannot update ("SYS"."SEQ$"."HIGHWATER") to NULL

The dictionary table for sequences (seq$) has a column highwater declared as NOT NULL, and it’s easy to see from a trace file that the internal code to restart a sequence doesn’t supply a value for this column in its call to update the seq$ table. This is probably bug 19602919, for which a patch is available on several platforms, but the bug is unpublished, so I haven’t been able to confirm the details.

There are a couple of little traps to the restart option. First, if you restart a session sequence, you will find that every session that is currently using that sequence will immediately see the effects of the restart; the restart is not restricted to just your own session. I’d like to see the restart made completely private, but since it’s just one of many options in the alter sequence command, I can understand why it might be difficult to make a special case for it. Of course, for similar reasons, you can’t restart a sequence in a read-only database – Oracle will raise the error you might expect: ORA-16000: database or pluggable database open for read-only access.

The second trap appears with incomplete statements of action. Here’s an example (run on 19.11.0.0):

drop sequence s1;
create sequence s1 
        start with 100 
        increment by 2
        cache 1000 
        minvalue 10 
        maxvalue 1e6
;
select s1.nextval from dual;
select s1.nextval from dual;
alter sequence s1 restart;
select s1.nextval from dual;
select s1.nextval from dual;

What values do you think the four calls for s1.nextval will produce? You may be expecting 100, 102, 100, 102. However, it’s actually 100, 102, 10, 12. Oracle doesn’t remember the original start with clause; it restarts at your minvalue unless you explicitly include a start with value in the call to restart. Perhaps you guessed correctly what would happen because I supplied a hint by explicitly setting the minvalue to a non-default value If I hadn’t done that, would you have guessed that Oracle would restart at the default minvalue of 1?

Identity

One of the questions that used to appear regularly on Oracle forums was: “How do I find out which sequence is being used to generate the key for this table?” The most typical answer was that there was no “tight” connection between sequences and primary keys – you create a sequence, you write code (often seen in a trigger) to select from the sequence and populate the column.

After many years Oracle allowed schema_name.sequence_name.nextval to be used as a default value for a column, making the connection much tighter. This didn’t stop anyone from using the nominated sequence for other reasons, even to the extent of making it the default value for columns in several other tables.

Finally, in 12cR1, Oracle introduced the “identity” mechanism, which sets up a rigid link between a column and a system-generated sequence. There can be only one identity column in a table, and the sequence cannot be accessed in any way other than Oracle’s internal access to populate the column. The one identity per table rule is enforced by the simple implementation detail that the name of the sequence includes the object_id of the table in the format: ISEQ$$_{table_object_id}

The identity column doesn’t conform 100% to expectations, however. It doesn’t automatically get declared unique, though it automatically acquires a not null declaration and constraint. If you want the column to be used as the primary key, you have to declare it as such.

There are three variants on adding an identity column to a table – shown below in a single statement with two of the variants commented out:

create table t (n1 number(6,0));
alter table t
        add ident number(12,0)
                generated
                        always
                        -- by default
                        -- by default on null
                as identity 
                cache 1000 start with 10 maxvalue 1e6
;

The example creates a table with a single numeric (integer) column called n1. It adds a column called ident, which I’ve defined as a numeric (integer) column and declared as an identity column. Although I’m not allowed to name the sequence that will supply values for the column, I’ve shown that I’m allowed to supply standard sequence options that dictate how the values will be generated.

To explain how the three identity variants (always, by default, by default on null) work, I can insert three carefully constructed rows into the three versions of the table and see what the results look like when I select them back. Here are the three insert statements:

insert into t(n1) values (1);
insert into t (n1, ident) values (2,null);
insert into t (n1, ident) values (3,101);
select * from t;

In the first case, I haven’t inserted the ident column at all.

In the second case, I’ve inserted an explicit null into the ident column

In the last case, I’ve inserted a non-null value into the ident column

Here are the three sets of effects from the three separate tests (with a little cosmetic editing for ease of reading):

Generated always

1 row created.
ORA-32795: cannot insert into a generated always identity column
ORA-32795: cannot insert into a generated always identity column
        N1      IDENT
---------- ----------
         1         10

In this first case, two of the inserts fail because I’ve supplied an input (albeit that one of them is an explicit null) for the ident column. The first of the three inserts didn’t try to supply anything at all for ident and has used the sequence to generate the value 10.

Generated by default

1 row created.
ORA-01400: cannot insert NULL into ("TEST_USER"."T"."IDENT")
1 row created.
        N1      IDENT
---------- ----------
         1         10
         3        101

In the second case, Oracle used the sequence to generate the value 10 when I didn’t include ident in my insert statement, and it accepted the actual value I’d supplied for ident for the third statement. However, when I tried to insert a row where the ident column was supplied with an explicit null, Oracle rejected it. The column acquired a not null declaration as it became an identity, so I can’t insert a null, but Oracle is only allowed to use the default value if the insert hasn’t attempted to supply anything at all for the column.

Generated by default on null

1 row created.
1 row created.
1 row created.
        N1      IDENT
---------- ----------
         1         10
         2         11
         3        101

Finally, I have a version where all three inserts are acceptable. The insert which doesn’t supply an input for ident is supplied with a value by the sequence. The insert with the explicit null supplied for ident is overwritten by a value supplied by the sequence, and the insert with a non-null ident accepts that value.

There are other restrictions on the ident column that vary with the identity declaration. You can’t update the identity column if it is declared generated always, but you can update it if it’s generated by default or generated by default on null, provided you don’t try to update it to null.

It’s nice to have a choice of strategies, and the best choice for your application may depend on the style of code that has already been written. You may also decide that you want to change an existing application that uses an older mechanism (such as a simple default sequence.nextval declaration) to take advantage of the new identity mechanism This might mean you have to convert your existing data using one mechanism, then go into production using another – e.g., conversion using by default on null then changing to always when the data is complete.

Add/Drop/Modify identity

Despite the heading, I’m going to start with dropping an identity because it’s the easy option:

alter table t modify ident drop identity;

That’s it – the sequence is dropped. The not null declaration and constraint disappear, but the column is still there. Big warning, though: if you do this, you can’t just put it back again. You can’t modify an existing column to become an identity; you can only add a new identity column.

This restriction seems like a bit of a show-stopper to me (and I may be wrong), but it looks as if you can’t take a table with a user-coded “identity-like” column and tell Oracle “I want this column to become an Oracle-managed identity.”

You’ve already seen the example of adding an identity to a table. However, in my original example, I created a table and added the identity column while it was still empty, and that hides a threat. If you want to add an identity column to a table with existing data, then Oracle will lock the table before using the generated sequence to update every row. This may lead to some rows being migrated, of course, especially if you’ve used the scale extend option.

It is possible to modify the definition of an identity – mainly to allow you to correct errors in the way you’ve defined the underlying sequence: a larger cache size is the most likely requirement. Imagine, however, that you’ve used by default or by default on null in your identity definition. This would have allowed your application to insert values greater than the current value of the underlying sequence.

Now consider the possibility that you’ve created a primary key index on that column. What are you going to do when the inevitable collision arrives, and the supporting sequence reaches a value that has previously been manually entered? You’re going to get a “duplicate key” error and will need to adjust the sequence to bypass the value. Though it’s not documented in the earlier manuals, there’s a special case for start with when a sequence is associated with an identity column:

alter table t modify ident generated as identity start with limit value;

The special case of limit value tells Oracle to scan the table for the largest value in the column and jump the sequence to the next value above that.

Oracle 12c sequences

This final article on sequences has provided a brief tour of the scale, shared, keep, and session options for sequences. The scale option was covered in more detail in an earlier article, and the shard and keep options are likely to be used by only a small set of more sophisticated users. The session option for creating “private” sequences that could be used in read-only databases is the one most likely to be of general use.

I’ve highlighted the appearance (after a very long wait) of the restart option for sequences and have highlighted a couple of little traps with its use: where will a sequence restart from, and what side effects will a restart have for a session sequence that is currently in use in multiple sessions.

Finally, the article covered the identity column declaration and how it is dependent on a system-generated sequence then used a simple example to show the effects of the three different definitions on data input. Two of the definitions allow for data creation that could cause a primary key (or unique) constraint to raise an exception, so the article explained the special option for modifying the sequence to jump above the boundary value in the column. Also noted, that it’s easy to remove the identity mechanism from a column, but a lot more work to add the identity mechanism to an existing data set.

 

The post Oracle sequences – 12c features including “identity” appeared first on Simple Talk.



from Simple Talk https://ift.tt/3mi9Vms
via