Friday, December 29, 2023

Understanding the Importance of Data Lineage in Modern Data Management

Data lineage is a component of modern data management that helps organizations understand the origins, transformations, and movement of their data. It is like a road map that shows us where our data has been, how it has changed, and where it is going, just like tracking the journey of a package: from the person who sent it (the source) to the places it passes through, and finally to the person who receives it.

The concept of data lineage has been around for many years, but it has become increasingly important in recent years due to the growth of big data and the increasing complexity of data processing systems.

One of the earliest examples of data lineage can be found in the work of Dr. Donald Knuth, who developed a system called TeX in the 1970s. TeX is a typesetting system that uses a complex set of rules to generate high-quality printed output. Knuth developed a system called Metafont to track the flow of data through TeX. Metafont allowed Knuth to identify and fix errors in his typesetting rules.

In the 1980s, data lineage began to be used in the financial services industry to track the flow of financial transactions. This was due to the increasing need for financial institutions to comply with regulations such as the Gramm-Leach-Bliley Act.

In the 1990s, data lineage began to be used in the healthcare industry to track the flow of patient data. This was due to the increasing need for healthcare providers to comply with regulations such as the Health Insurance Portability and Accountability Act (HIPAA).

In today’s data-driven world, organizations are collecting and storing more data than ever before. This data is important for making informed business decisions, fueling innovation, and enabling organizations to better understand their customers and markets.

It provides a detailed record of data’s path, showcasing how it is created, modified, used, and transformed throughout various processes within an organization. Data lineage helps understand the data’s quality, lineage, and business relevance. It can be represented in graphical or tabular formats, allowing stakeholders to comprehend complex data relationships and dependencies easily.

Data lineage has three main components that help us understand how it works. They include:

  • Data Source: The journey of data commences at a data source. Data sources serve as the birthplaces or points of origin where data is either generated or acquired. These sources encompass various forms, including databases, spreadsheets, files, APIs (application programming interfaces) and even streaming platforms. To illustrate, when a customer places an order on an e-commerce website, the details of that order constitute the data generated at the source, which in this case is the website’s database. Data sources are important for understanding the context and quality of the data, essentially laying the foundation for subsequent stages in the data journey. At this initial stage, data can exist in its raw, unprocessed form, or it may have already undergone some level of transformation. However, while databases, spreadsheets, and files themselves can’t be considered the absolute start of data (since data inherently represents something), they can indeed serve as the source of data within a specific system’s perspective.
  • Data Transformation and Processes: The journey of data is rarely a direct path from source to destination. Data often undergoes transformations and processes, changing its structure, format, or even meaning. These transformations can be as simple as converting data from one format to another or as complex as predictive analytics and machine learning algorithms altering and enhancing the data.
    Transformations can occur at various stages during the data’s journey, and they significantly impact how the data is perceived and utilized. These transformations and processes are key components of data lineage, as they shape the data, turning it into an asset for decision-making and insights.
  • Data Destination: The final stop on the data journey is its destination—where it is consumed, stored, or used for decision-making. Data destinations can be databases where data is stored for future reference, visualization tools where it is used for creating reports, or other systems where it is utilized for various purposes. Data destinations complete the cycle of data lineage, providing a clear picture of where the data ends up and how it contributes to organizational goals.

A data lineage map is a visual representation of the flow of data through an organization’s systems. It shows how data is extracted from source systems, transformed, and loaded into target systems. Data lineage maps can be used to track the movement of data through complex data processing pipelines, identify dependencies between data assets, and troubleshoot data quality issues.

Here is a small example of a data lineage map broadening the context with business term lineage:

Image created with Collibra

The picture above shows a complete data lineage. The user is interested in very specific data: human resources. Such columns will have a proper business term—something like “salary” or “promotions”—assigned. With Collibra, you can see the lineage for that glossary term and find the best quality assets in every system. It shows the lineage retention policy data.

Types of Data Lineage

Data lineage manifests in various forms, each offering a distinct perspective on how data evolves and is utilized within an organization. Understanding these types of data lineage is essential for organizations aiming to glean comprehensive insights into their data landscape. The primary types of data lineage are covered in this section:

Forward Data Lineage

Forward data lineage, also known as “upstream” or “inbound” data lineage, traces the path of data from its origins to its current state. It follows the data as it progresses through different stages of processing, transformations, and storage, eventually reaching its peak in consumption. It provides insights into the data’s journey, including the different stages of the data pipeline where data is processed and transformed, the specific data transformations that are applied to the data, the storage locations where the data is stored, and the downstream applications and analytics that consume the data.

This type of lineage is essential for understanding how data is created, transformed, and utilized within an organization. If a company collects customer data, forward data lineage would trace it from the moment it is gathered through various transformations until it is used in generating reports or analytics, providing insights into data quality and integrity. Forward data lineage can trace a certain part of data, or it can start at all the starting endpoints and go forward. It depends on the specific needs of the organization and the use case for the data lineage.

Image created in: Collibra

This scenario shows how the “Promotions” column in the “employees_view” table is derived from data in the “HR Glossary” and “employees” tables. This is an example of forward data lineage.

Backward Data Lineage

Contrary to forward data lineage, backward data lineage, also termed “downstream” or “outbound” data lineage, follows the data in the reverse direction. It starts from where the data is consumed or used and traces its path backward to its source. It is instrumental in understanding the impact of data on downstream processes and in identifying the origin and quality of the data. For instance, if an organization discovers discrepancies in the results of a specific analysis, backward data lineage can help pinpoint the source of the issue by retracing the data’s journey and revealing transformations or discrepancies that occurred along the way. It can be used to identify the root cause of data quality issues, track data lineage for compliance purposes, support data security and risk management, and enable data lineage-driven optimization of data pipelines.

Image Source: Collibra

In the scenario above, the user can use backward data lineage to understand how the data in the “Promotions” column was derived and what other data sources it depends on. This information can be used to understand the data quality and to identify potential risks.

Mixed Data Lineage

As the name suggests, mixed data lineage combines elements of both forward and backward data lineage. It provides a comprehensive view of how data flows in both directions, from its source to consumption and vice versa. Mixed data lineage fully explains data flow and its interplay between creation, transformation, storage, and utilization. In a complex data ecosystem, mixed data lineage is vital for data governance, as it ensures that all aspects of data flow, whether upstream or downstream, are well documented and understood.

Impact Analysis Data Lineage

The impact analysis data lineage delves into the effects and influence of changes to data or processes. It helps organizations understand how a modification to a data source, transformation, or storage system can impact downstream processes, applications, or analytics. Impact analysis data lineage can be used to assess the risk of proposed changes to the data pipeline, develop mitigation strategies for potential impact, and make informed decisions about data pipeline changes. If a company plans to upgrade its database system, impact analysis data lineage can help predict how this change will affect various departments, ensuring a smoother transition. Impact analysis data lineage is particularly important for organizations that are heavily reliant on data-driven applications and analytics.

Business Data Lineage

The business data lineage focuses on the business context and relevance of the data. It maps the data flow in a way that aligns with business processes, enabling stakeholders to comprehend how data supports business objectives. It helps bridge the gap between technical aspects of data lineage and business requirements, fostering a more comprehensive understanding of data utility. For instance, in a retail company, business data lineage could illustrate how customer purchase data feeds into inventory management and informs restocking decisions. It can be used to identify the key data assets that drive business value, understand the data flows that support key business processes, assess the impact of data quality issues on business outcomes, and make informed decisions about data investments. It is essential for bridging the gap between technical aspects of data lineage and business requirements.

Operational Data Lineage

Operational data lineage emphasizes the technical aspects of data flow, encompassing the detailed processing steps and transformations involved. It is essential for data engineers, data architects, and technical teams to optimize data pipelines for performance and reliability, troubleshoot data quality issues, ensure data quality compliance, and measure the impact of data pipelines. It provides a view of data flow that is essential for maintaining the efficiency and effectiveness of data pipelines.

Each type of data lineage provides insights into the data’s journey and usage, catering to specific needs within an organization. Organizations can thoroughly understand their data ecosystem by leveraging a combination of these types based on their objectives and requirements. This understanding, in turn, facilitates better decision-making, improved data management, and the effective utilization of data to drive business growth and innovation.

Importance of Data Lineage in Modern Data Management

Understanding the importance of data lineage is necessary. Data lineage serves as a guiding beacon, illuminating the complex paths traversed by data from its inception to its utilization. Here is why understanding data lineage is important for modern data management:

  • Enhanced Data Quality and Reliability: Understanding data lineage allows organizations to carefully examine and validate the quality and reliability of the data. By tracing the transformations and processes the data undergoes, any anomalies, errors, or inconsistencies can be identified and rectified, ensuring data integrity and reliability.
  • Facilitates Regulatory Compliance: Compliance with regulatory frameworks and data privacy laws is a concern for modern organizations. Data lineage provides a robust mechanism for demonstrating compliance by offering a clear and auditable trail of how data is sourced, manipulated, stored, and consumed. This transparency is necessary in regulatory audits and ensuring adherence to legal requirements.
  • Effective Data Governance and Management: Data governance is a fundamental aspect of modern data management. Data lineage supports effective data governance by providing a comprehensive understanding of data flows and dependencies. This knowledge enables organizations to establish data governance policies, ensuring data quality, security, privacy, and compliance with organizational guidelines.
  • Efficient Troubleshooting and Issue Resolution: In the data ecosystem, issues are inevitable. Data lineage aids in efficient troubleshooting by enabling quick identification and isolation of problems in the data pipeline. Whether it’s an error in a transformation or an inconsistency in the data source, understanding the lineage helps in swiftly resolving issues, minimizing downtimes, and enhancing operational efficiency.
  • Optimized Data Transformation Processes: Understanding the transformations data undergoes is essential for optimizing data transformation processes. Data lineage provides insights into how data is transformed, which transformations are important, and where efficiency improvements can be made. This optimization is vital for streamlining data processing, reducing processing times, and enhancing resource utilization.
  • Facilitates Data Impact Analysis: Data lineage facilitates impact analysis, allowing organizations to predict the effects of changes to data sources, transformations, or destinations. This foresight is invaluable in risk assessment and change management, ensuring that modifications to the data pipeline are thoroughly evaluated before implementation, minimizing potential disruptions.
  • Supports Data Lineage Governance: Implementing data lineage governance ensures that the data lineage remains accurate, consistent, and up-to-date. It involves establishing processes and roles responsible for maintaining and validating the lineage. Effective governance ensures that the data lineage remains a reliable and trustworthy source of truth for stakeholders.

Ways for Establishing Data Lineage

Establishing an effective data lineage is an important initiative for organizations aiming to gain comprehensive insights into their data flow, usage, and transformation processes. While it may seem daunting, there are several strategic approaches and methods that organizations can adopt to establish a robust data lineage. Here are some key strategies to achieve this:

Documentation of Data Sources and Processes

Begin by documenting all data sources and processes within the organization. This involves identifying the systems, applications, and devices that generate or capture data and the various processes through which data passes during its lifecycle. Comprehensive documentation serves as the foundation for creating data lineage.

Collaborative Workshops and Interviews

Conduct collaborative workshops and interviews involving relevant stakeholders, including data engineers, data analysts, business users, and subject matter experts. Engage these stakeholders to gather insights into the data flow, transformations, and consumption patterns. Collaborative discussions often unveil data pathways and dependencies that might not be evident initially.

Utilizing Automated Data Lineage Tools

Leverage automated data lineage tools and platforms designed to capture and maintain data lineage automatically. These tools analyze metadata, logs, and system interactions to map out the data flow, providing a real-time view of how data moves through the organization. Automation accelerates the process and reduces the likelihood of errors associated with manual documentation.

Implementing Metadata Management Systems

Invest in metadata management systems that catalog metadata associated with various data assets, including data sources, transformations, and consumption points. Metadata management facilitates the tracking and organization of data lineage, ensuring that accurate and up-to-date metadata is available for lineage construction.

Data Lineage Governance and Policies

Establish governance policies and processes specific to data lineage. Define roles and responsibilities for managing, updating, and validating data lineage. Implement checks and balances to ensure that data lineage remains consistent, accurate, and aligned with organizational data management objectives.

Continuous Monitoring and Updates

Data lineage is not a one-time effort but an ongoing process. Establish mechanisms for continuous monitoring and updates to reflect any changes in data sources, transformations, or consumption patterns. Regularly review and validate the accuracy and relevance of the documented lineage to keep it up-to-date and reflective of the current data landscape.

Standardized Data Naming Conventions and Metadata Tags

Enforce standardized data naming conventions and metadata tagging. Consistent naming conventions and metadata tags simplify the mapping of data lineage, making it easier to track and trace data as it moves through different stages and processes.

Training and Education Programs

Conduct training and education programs for employees to enhance their understanding of the importance of data lineage. Train individuals involved in data management on how to capture, interpret, and utilize data lineage effectively. An informed workforce is essential for the successful establishment and maintenance of data lineage.

Regular Data Lineage Audits and Validation

Conduct periodic data lineage audits to validate the accuracy and completeness of the documented lineage. Audits help identify discrepancies, errors, or missing links in the lineage, allowing for timely corrections and improvements. The insights gained from audits also contribute to refining the data lineage documentation process.

Integration with Data Lineage Visualization Tools

Utilize data lineage visualization tools to create graphical representations of the data flow. Visualization aids in understanding complex data relationships and dependencies. Integrating these visualizations with data lineage documentation enhances comprehension and usage of the lineage information.

Organizations can construct an accurate, comprehensive, and dynamic data lineage that forms the cornerstone of effective data management by implementing these strategies and maintaining a diligent approach.

Capturing and Visualizing Data Lineage

Capturing and visualizing data lineage is also an important aspect of modern data management, empowering organizations to understand, monitor, and optimize their data flows.

Techniques for Capturing Data Lineage:

Here, we look at different techniques that facilitate the accurate capture of data lineage:

Automated Metadata Harvesting

Automated metadata harvesting is a pivotal technique in data lineage capture. This technique focuses on continuously and automatically extracting metadata from various data sources and processes. Organizations deploy sophisticated software tools designed to traverse through their vast data ecosystems, which can include databases, data lakes, and ETL (Extract, Transform, Load) processes. These tools, often referred to as metadata crawlers, automatically extract a wealth of information from these sources. This information typically includes details about data sources, schema structures, data transformations, and storage mechanisms. By compiling this metadata, organizations gain a comprehensive view of their data landscape. This technique not only simplifies lineage tracking but also aids in data governance, compliance, and quality assurance. The automation aspect is in the regular and systematic harvesting of metadata without being triggered by specific events. It provides a comprehensive, continuous view of the data landscape, making it easier to track lineage and manage data.

Metadata Ingestion and Parsing

When it comes to metadata ingestion, extracting data attributes from various sources is just the beginning. Organizations employ metadata parsers or custom scripts to transform this raw metadata into a standardized format. For instance, details about data source origins, transformation steps, and storage locations are parsed into a consistent structure. This standardized metadata format is important for effective lineage tracking and analysis. It ensures that lineage information remains uniform and can be easily correlated across different data elements, even if they originate from disparate systems or formats.

Event-Based Tracking

In event-based tracking, data lineage is captured through monitoring events, actions, or triggers within data processing systems. Whenever a significant event occurs, such as data transformation or storage, metadata is captured to track the lineage. Real-time monitoring allows organizations to maintain an up-to-date and accurate lineage map. Some events that can be tracked include data extraction from a source system, data loading into a target system, data transformation, such as cleansing, aggregating, or joining, data movement between systems, and data deletion or archival. Event-based tracking can also be automated in the sense that it captures lineage information when specific events occur. However, the key difference between it and automated metadata harvesting is that it’s event triggered. Metadata is captured in response to significant events or actions in data processing, such as data transformations, data loading, or data movement. This approach focuses on capturing lineage data at precise moments rather than continuously scanning the entire ecosystem. Event-based tracking ensures that lineage information is recorded as data flows through the system.

Data Lineage by Design

Data lineage by design is a technique for capturing data lineage as an integral part of the data architecture and integration processes. During the development phase, architects and developers consciously design systems to generate metadata and lineage information, ensuring seamless and automatic lineage capture during data processing. One example of data lineage by design is to use a data catalog to store and manage metadata about data assets. The data catalog can be configured to track the lineage of data as it flows through the system. Another example is to use a data pipeline management tool to automate the data processing process. The data pipeline management tool can be configured to capture lineage information at each stage of the pipeline.

Why is it Difficult to Capture Data Lineage Later?

Capturing data lineage later can be difficult because it requires organizations to track down and document all the different systems and processes that data flows through. This can be a complex and time-consuming task, especially for large and complex organizations. In addition, data lineage information may be scattered across different systems and tools. This can make it difficult to get a complete view of the data lineage. Finally, organizations may not have the necessary expertise to capture and manage data lineage. This can lead to incomplete or inaccurate lineage information.

Tools for capturing data lineage

Several tools and techniques are available to facilitate the capture and visualization of data lineage effectively. Here are some tools for capturing and visualizing data lineage:

  • Talend Data Fabric is an integration platform that provides capabilities for capturing and visualizing data lineage. It allows users to document metadata, transformations, and lineage in an automated and centralized manner. The tool offers graphical lineage representations, simplifying the understanding of data flow and transformations.

  • Collibra Catalog is a data governance and cataloging tool that includes capabilities for capturing and visualizing data lineage. It allows users to define and capture metadata, relationships, and lineage information. The tool provides a user-friendly interface to visualize the flow of data and understand how it is transformed and utilized across various processes and systems. It is not only an asset for data lineage capture but also a valuable tool for overall data governance.

 

  • Talend Data Fabric is an integration platform that provides capabilities for capturing and visualizing data lineage. It allows users to document metadata, transformations, and lineage in an automated and centralized manner. The tool offers graphical lineage representations, simplifying the understanding of data flow and transformations.

 

  • Apache Atlas is an open-source tool that was first released in 2013. It is widely used for metadata management and governance. It allows organizations to capture and manage metadata, including lineage information. By integrating with various data processing and storage platforms, Apache Atlas offers a centralized repository for metadata, facilitating efficient lineage tracking and visualization, making it easier to trace data origins and understand how data flows through the organization’s infrastructure.
  • Informatica Metadata Manager is a powerful tool that has been around for over 20 years, it assists organizations in capturing and documenting metadata, including data lineage. It seamlessly integrates with diverse data sources and provides a unified view of metadata, making it easy to trace data transformations and dependencies. The tool offers the ability to capture and manage metadata from a variety of data sources, including relational databases, data warehouses, data lakes, and big data platforms, the ability to visualize lineage information in a variety of ways, including interactive dashboards and reports.

 

  • IBM’s Information Governance Catalog is a robust tool that enables organizations to capture, manage, and analyze metadata, including data lineage. It integrates with different data sources and applications to collect metadata, providing a comprehensive view of how data moves and transforms within the organization. The tool offers visualization features to represent lineage effectively, including interactive dashboards, reports, and graphs.
  • The Alation Data Catalog is a cloud-native data catalog that facilitates capturing and visualizing data lineage. It automatically captures metadata and lineage information from various data assets, including relational databases, data warehouses, data lakes, and big data platforms, and presents it in a visually intuitive manner. Users can explore data flows, transformations, and dependencies through interactive visualizations, enhancing their understanding of the data lineage.

Techniques for Visualizing Data Lineage:

Here, we explore few techniques that facilitate the visualization of data lineage:

Graphical Representation

A flow chart of data lineage

Graphical representation is a widely adopted technique for visualizing data lineage. Graphs, flowcharts, and diagrams are utilized to depict the flow of data from source to destination, along with the transformations and processes that data undergoes. Nodes represent data elements, while edges illustrate the connections and transformations.

Hierarchical Tree View

A hierarchical tree view of data lineage

The hierarchical tree view technique organizes data lineage in a tree-like structure, presenting a clear hierarchy of data sources, transformations, and destinations. Users can navigate through the tree, expanding or collapsing nodes to visualize the data flow and understand the lineage in a structured and organized manner.

The specific tools and techniques that you use to capture and visualize data lineage will depend on your specific needs and resources. However, here are some general tips:

  • Start by identifying the data lineage that you want to capture. Do you want to capture the lineage of all your data or just a specific data set or flow?
  • Choose the right tools and techniques for your needs. If you have a complex data environment, you may want to use an automated data lineage tool. If you have a smaller data environment, you may be able to capture data lineage manually using spreadsheets or other documentation tools.
  • Use a variety of tools and techniques. There is no one-size-fits-all approach to capturing and visualizing data lineage. The best approach will vary depending on the specific data lineage that you want to capture and the tools and resources that you have available.
  • Make sure that your data lineage is accurate and up to date. Data lineage is only useful if it is accurate and up to date. Be sure to review and update your data lineage regularly as your data environment changes.

Conclusion

Data lineage is an important component of modern data management. By tracking the flow of data over time, organizations can gain a deeper understanding of their data and how it is used. This information is essential for improving data quality, compliance, security, and business intelligence.

There are several ways to establish and maintain data lineage. The best approach for an organization will depend on its specific needs and resources. However, all organizations should follow certain best practices to maintain an accurate and comprehensive data lineage. Overall, data lineage is an essential tool for any organization that wants to get the most out of its data. By tracking the flow of data, organizations can improve data quality, enhance compliance, increase security, reduce costs, improve efficiency, and increase agility.

 

The post Understanding the Importance of Data Lineage in Modern Data Management appeared first on Simple Talk.



from Simple Talk https://ift.tt/f1crRhA
via

Thursday, December 28, 2023

Eager Aggregation in SQL queries

Aggregation is a widely used way to summarize the content of a database. It is usually expressed with GROUP BY clause or just using aggregate functions (like COUNT or SUM). When the database engine executes a query with aggregations, it produces individual rows need to compute the required output and then performs the aggregation as (almost) last step. We discuss in this article how to re-write a query manually so that the order of operations will be different and when it can be beneficial.

We start with remainder that SQL is a declarative language, that is, a properly written query specifies what should be included into result but does not specify how to calculate this result. There are several ways (called execution plans) to do that for almost any query. All execution plans for a query produce same results but may utilize different amounts of computing resources. An optimizer tries to choose the best plan for execution. Usually, state-of-the-art optimizers do their job well but sometimes they fail to choose a good plan. This may happen for different reasons:

  • The data statistics and/or cost model are imprecise.
  • The optimizer does not consider some classes of plans.

In this article we discuss one type of query transformation that most optimizers do not use. Because of this, it can be beneficial for you to rewrite a query to help the optimizer order operations in a way that can be beneficial.

An analytical query is supposed to produce some kind of summary generalizing properties of huge amounts of data but at the same time should be compact and easy for humans to understand. In terms of the SQL query language this means that any analytical query extracts and combines large number of rows and then uses aggregate functions with or even without GROUP BY clause. More specifically, we consider queries that contain many JOIN operations followed by aggregation. Usually, queries are written in this way and, surprisingly, the optimizers choose the best order of joins but leave the aggregation as the last step.

Aggregation reduces the number of rows that will eventually be output from the query. Intuitively we can expect that it is possible to reduce the quantity of resources needed for execution if aggregates could be computed before joins. Of course, this is not always possible (for example when the argument of an aggregate function combines columns from joined tables), but even partial aggregation can reduce the cost significantly.

Our example uses the postgres_air database which can be downloaded from (https://github.com/hettie-d/postgres_air). You can download and restore the database if you want to execute any of the code in the article yourself.

The following ER diagram shows a subset of the tables in the postgres_air database that are important for our example. The tables that will be used in the example query are highlighted with the green background.

A screenshot of a computer Description automatically generated

A row in the table booking represents a ticket. Each ticket is for one or more passengers and is linked to one or more flights. The table booking_leg is actually a relationship between the booking and flight tables. We do not need the booking and passenger tables in this example. A boarding pass is issued for each passenger, for each leg in the booking. We use a modified query discussed in [1, chapter 6]. The query returns the number of passengers departing from an airport during a month. To make results compact, the query returns5 airport-months with largest numbers of passengers:

air=# SELECT
air-#      a.city,
air-#      f.departure_airport,
air-#      to_char(
air(#          date_trunc(‘month’, scheduled_departure),
air(#          ‘Mon YYYY’)  AS month,
air-#      count(passenger_id)  passengers
air-# FROM airport  a
air-# JOIN flight f ON a.airport_code = f.departure_airport
air-# JOIN booking_leg l ON f.flight_id =l.flight_id
air-# JOIN boarding_pass b ON b.booking_leg_id = l.booking_leg_id
air-# GROUP BY city, f.departure_airport,month
air-# ORDER BY passengers DESC
air-# LIMIT 5;

    city     | departure_airport |  month   | passengers 
-------------+-------------------+----------+------------
 CHICAGO     | ORD               | Jul 2023 |     387568
 CHICAGO     | ORD               | Jun 2023 |     352375
 NEW YORK    | JFK               | Jul 2023 |     349624
 LOS ANGELES | LAX               | Jul 2023 |     325389
 NEW YORK    | JFK               | Jun 2023 |     317593
(5 rows)
Time: 23997.359 ms (00:23.997)

The limit of the output to 5 rows does not significantly affect the execution time because the rows are sorted and, therefore, all rows must be produced.

When SQL query performance is discussed, the best place for a DBA to start is with analysis of the execution plan. So, the execution plans are included below. However, if you are not comfortable reading large execution plans but know or can believe that time needed for execution of a query depends mostly on the size of tables, you can skip the execution plan and go to a compact summary after the execution plan.

Even without any analysis we can guess that the number of boarding passes is significantly larger than the number of airports. This observation can be confirmed with an execution plan:

Limit  (cost=8513712.70..8513712.73 rows=5 width=49) (actual time=25412.164..25412.781 rows=10 loops=1)
  ->  Sort  (cost=8513712.70..8576950.21 rows=25295004 width=49) (actual time=25412.163..25412.165 rows=10 loops=1)
      Sort Key: (count(*)) DESC
      Sort Method: top-N heapsort  Memory: 26kB
      ->  HashAggregate  (cost=6698394.21..7967096.76 rows=25295004 width=49) (actual time=25411.676..25411.971 rows=2207 loops=1)
          Group Key: a.city, to_char(date_trunc(‘month’::text, f.scheduled_departure), ‘Mon YYYY’::text)
          Planned Partitions: 256  Batches: 1  Memory Usage: 1809kB
          ->  Hash Join  (cost=631461.30..1939771.59 rows=25295004 width=41) (actual time=2507.103..22475.673 rows=25293491 loops=1)
                Hash Cond: (f.departure_airport = a.airport_code)
                ->  Hash Join  (cost=631435.72..1746464.96 rows=25295004 width=12) (actual time=2505.722..12797.428 rows=25293491 loops=1)
                      Hash Cond: (l.flight_id = f.flight_id)
                      ->  Hash Join  (cost=604104.22..1451779.66 rows=25295004 width=4) (actual time=2344.625..8931.789 rows=25293491 loops=1)
                            Hash Cond: (b.booking_leg_id = l.booking_leg_id)
                            ->  Seq Scan on boarding_pass b  (cost=0.00..513758.04 rows=25295004 width=8) (actual time=0.661..1714.724 rows=25293491 loops=1)
                            ->  Hash  (cost=310526.32..310526.32 rows=17894232 width=8) (actual time=2340.328..2340.328 rows=17893566 loops=1)
                                  Buckets: 262144  Batches: 128  Memory Usage: 7512kB
                                  ->  Seq Scan on booking_leg l  (cost=0.00..310526.32 rows=17894232 width=8) (actual time=0.016..1108.611 rows=17893566 loops=1)
                      ->  Hash  (cost=15455.78..15455.78 rows=683178 width=16) (actual time=158.486..158.486 rows=683178 loops=1)
                            Buckets: 262144  Batches: 8  Memory Usage: 6061kB
                            ->  Seq Scan on flight f  (cost=0.00..15455.78 rows=683178 width=16) (actual time=0.049..72.871 rows=683178 loops=1)
                ->  Hash  (cost=16.92..16.92 rows=692 width=13) (actual time=0.380..0.380 rows=692 loops=1)
                     Buckets: 1024  Batches: 1  Memory Usage: 39kB
                     ->  Seq Scan on airport a  (cost=0.00..16.92 rows=692 width=13) (actual time=0.025..0.184 rows=692 loops=1)

 Planning Time: 4.381 ms
 Execution Time: 25414.796 ms

(25 rows)

The following table presents the essential data in more compact form:

Table or table expression

Number of rows

airport

692

flight

683178

booking_leg

17893566

boarding_pass

25293491

All tables joined

25293491

After GROUP BY

2568

Booking_les with boarding_pass

8841899

Booking_leg join boarding_pass group by flight_id

254253

Each boarding pass row is counted exactly once, therefore, the number of rows after all joins is equal to the number of boarding passes. Boarding passes are issued not earlier than 24 hours before the departure. Therefore, there are no boarding passes for future bookings. For the same reason, the number of flights in the last row of the table is less than the total number of flights.

These observations suggest that we can try to reduce the number of rows after joining the two largest tables, namely, booking_leg and boarding_pass. The modified query looks somewhat more complicated but the execution time is significantly better:

air=# SELECT
air-#      a.city,
air-#      f.departure_airport,
air-#      to_char(
air(#          date_trunc('month', scheduled_departure),
air(#          'Mon YYYY')  AS month,
air-#      sum(cnt.passengers)  passengers
air-# FROM airport  a
air-# JOIN flight f ON airport_code = departure_airport
air-# JOIN (
air(#       SELECT flight_id, count(passenger_id) passengers
air(#       FROM booking_leg l
air(#       JOIN boarding_pass b USING (booking_leg_id)
air(#       GROUP BY flight_id
air(#       ) cnt
air-# USING (flight_id)
air-# GROUP BY city, f.departure_airport,month
air-# ORDER BY passengers DESC
air-# LIMIT 5;


    city     | departure_airport |  month   | passengers 
-------------+-------------------+----------+------------
 CHICAGO     | ORD               | Jul 2023 |     387568
 CHICAGO     | ORD               | Jun 2023 |     352375
 NEW YORK    | JFK               | Jul 2023 |     349624
 LOS ANGELES | LAX               | Jul 2023 |     325389
 NEW YORK    | JFK               | Jun 2023 |     317593

(5 rows)

Time: 11621.977 ms (00:11.622)

Note that the intermediate grouping in this query is partial: it counts passengers on each flight but final grouping is still needed. The query returns exactly the same rows as the original one, but the execution time is below 12 seconds, while the original query took almost 24 seconds. This can be confirmed with the execution plan below (again, you can skip it if the difference in the execution time already convinced you):

Limit  (cost=3148532.58..3148532.61 rows=5 width=73) (actual time=12714.984..12714.987 rows=10 loops=1)
 ->  Sort  (cost=3148532.58..3149070.53 rows=215179 width=73) (actual time=12714.982..12714.984 rows=10 loops=1)
       Sort Key: (sum(cnt.passengers)) DESC
       Sort Method: top-N heapsort  Memory: 26kB
    ->  GroupAggregate  (cost=3137965.22..3143882.64 rows=215179 width=73) (actual time=12666.213..12714.380 rows=2207 loops=1)
           Group Key: a.city, (to_char(date_trunc('month'::text, f.scheduled_departure), 'Mon YYYY'::text))
           ->  Sort  (cost=3137965.22..3138503.17 rows=215179 width=49) (actual time=12664.787..12694.274 rows=254253 loops=1)
               Sort Key: a.city, (to_char(date_trunc('month'::text, f.scheduled_departure), 'Mon YYYY'::text))
               Sort Method: external merge  Disk: 9944kB
                ->  Hash Join  (cost=2901980.71..3111548.55 rows=215179 width=49) (actual time=11307.355..12554.764 rows=254253 loops=1)
                    Hash Cond: (f.departure_airport = a.airport_code)
                      ->  Hash Join  (cost=2901955.14..3109878.78 rows=215179 width=20) (actual time=11306.626..12455.699 rows=254253 loops=1)
                           Hash Cond: (cnt.flight_id = f.flight_id)
                           ->  Subquery Scan on cnt  (cost=2874623.63..3076544.43 rows=215179 width=12) (actual time=11143.457..12204.002 rows=254253 loops=1)
                                 ->  HashAggregate  (cost=2874623.63..3074392.64 rows=215179 width=12) (actual time=11143.456..12193.799 rows=254253 loops=1)
                                        Group Key: l.flight_id
                                        Planned Partitions: 4  Batches: 5  Memory Usage: 8241kB  Disk Usage: 344256kB
                                        ->  Hash Join  (cost=604104.22..1451779.66 rows=25295004 width=4) (actual time=2366.456..8823.030 rows=25293491 loops=1)
                                              Hash Cond: (b.booking_leg_id = l.booking_leg_id)
                                              ->  Seq Scan on boarding_pass b  (cost=0.00..513758.04 rows=25295004 width=8) (actual time=0.590..1697.414 rows=25293491 loops=1)
                                              ->  Hash  (cost=310526.32..310526.32 rows=17894232 width=8) (actual time=2362.140..2362.141 rows=17893566 loops=1)
                                                    Buckets: 262144  Batches: 128  Memory Usage: 7512kB
                                                    ->  Seq Scan on booking_leg l  (cost=0.00..310526.32 rows=17894232 width=8) (actual time=0.015..1113.579 rows=17893566 loops=1)
                            ->  Hash  (cost=15455.78..15455.78 rows=683178 width=16) (actual time=162.236..162.236 rows=683178 loops=1)
                                  Buckets: 262144  Batches: 8  Memory Usage: 6061kB
                                  ->  Seq Scan on flight f  (cost=0.00..15455.78 rows=683178 width=16) (actual time=0.041..74.490 rows=683178 loops=1)
                      ->  Hash  (cost=16.92..16.92 rows=692 width=13) (actual time=0.630..0.630 rows=692 loops=1)
                            Buckets: 1024  Batches: 1  Memory Usage: 39kB
                            ->  Seq Scan on airport a  (cost=0.00..16.92 rows=692 width=13) (actual time=0.024..0.299 rows=692 loops=1)
 Planning Time: 1.999 ms
 Execution Time: 12722.953 ms
(31 rows)

Any experienced DBA will notice that our execution plans do not contain enough information describing input/output needed for the query execution (such as number of buffers). There are two reasons for that:

  • Both plans contain hash joins only, so each row of input tables is accessed only once for both queries and I/O is exactly same.
  • Our queries are CPU-bounded, rather than I/O-bounded. There is no need to store any intermediate results between operations, and even sort uses main memory only.

So, is this kind of query transformations beneficial?

The answer is: it depends. Looking at the plan, if you compare the number of rows accessed to output the result, it is exactly as it was in the previous example. But the total costs were quite different:

Original query:

(cost=8513712.70..8513712.73 rows=5 width=49)

Rewritten query:

(cost=3148532.58..3148532.61 rows=5 width=73)

In the database folklore the JOIN operation is considered as a major resource consumer. However, the aggregation (grouping) has approximately same complexity as a join. This transformation includes additional grouping of a large table (or large intermediate result) in a hope to reduce complexity of subsequent join operations.

In our example the transformation reduces the number of rows from approximately 25 millions to 250 thousands. This reduction dramatically reduces the time needed for subsequent operations.

Transformation Rules

We are now ready to define the transformation rules. The query suitable for the transformation should contain joins followed by an aggregation (either GROUP BY or just returning one row), and the arguments of the aggregate functions must depend on a subset of joined tables. The code below is pseudocode, please do not try to execute it. We represent such queries in the following generic form:

SELECT
   ta.gr_attrib_a,
   tb.gr_attrib_b, 
   agg_func(tb.table_b_attrib)
FROM table_a ta 
   JOIN table_b tb 
      ON ta.join_attrib_a = tb.join_attrib_b
GROUP BY ta.gr_attrib_a, tb.gr_attrib_b

The result of the transformation of this generic query is:

SELECT
   ta.gr_attrib_a,
   tb.gr_attrib_b, 
   agg_func1(tb.table_b_attrib)
FROM table_a ta 
   JOIN (
   SELECT 
      join_attrib_b, 
      gr_attrib_b, 
      agg_func2(table_b_attrib)as table_b_attrib
   FROM table_b 
   GROUP BY join_attrib_b, gr_attrib_b 
   ) subquery tb 
   ON ta.join_attrib_a = tb.join_attrib_b
GROUP BY ta.gr_attrib_a, tb.gr_attrib_b

Informally, the transformation introduces additional grouping before the join. Aggregate functions needed after transformation may differ from the function in the original query (we discuss that later). Both table_a and table_b can be table expressions. Grouping and joining may involve multiple columns.

The table below shows how the example query from the previous section can be derived from the general form.

Aggregate Function

COUNT (attributes)

table_b_attrib

passenger_id

table_a

flight JOIN airport

table_b

booking_leg JOIN boarding_pass

join_attrib_a

flight_id

join_attrib_b

flight_id

gr_attrib_a

city, f.departure_airport,month

gr_attrib_b

  • – – EMPTY – – –

What about aggregate functions?

The example in the previous section uses COUNT as an aggregate function. In this section we provide transformation rules for other SQL aggregate functions.

We have already seen that for count in the original query agg_func1=sum and agg_func2=count. Functions SUM, MAX, and MIN are easy: agg_func1=agg_func2=agg_func for these functions. Indeed, these functions are commutative and associative. Therefore, arguments can be grouped arbitrarily in any order. The same applies also to the following aggregate functions available in PostgreSQL: bit_and, bit_or, bit_xor, bool_and, bool_or text.

The function AVG is trickier. By definition, the average is a ratio of sum of values to the quantity of these values, so the subquery can compute SUM and COUNT separately, and the final aggregation should divide sum of values by the sum of counts. However, the aggregate function sum ignores NULL values, while count(*) does not. Therefore, the correct expression should use count(column_in_sum). Also, the value of sum must be converted to float or double type before division.

SELECT
   ta.gr_attrib_a,
   tb.gr_attrib_b, 
   case when sum(attrib_qty)> 0 then
    (sum(attrib_sum)::double)/sum(attrib_qty)
    else NULL end as avg_value
FROM table_a ta 
   JOIN (
   SELECT 
      join_attrib_b, 
      gr_attrib_b, 
      sum(table_b_attrib)as attrib_sum,
      count(table_b_attrib)as attrib_qty
   FROM table_b 
   GROUP BY join_attrib_b, gr_attrib_b 
   ) subquery tb 
   ON ta.join_attrib_a = tb.join_attrib_b
GROUP BY ta.gr_attrib_a, tb.gr_attrib_b

An aggregate function that returns a concatenation of its input (such as string_agg, array_agg, xel_agg) depends on the order of rows and therefore cannot be used together with eager aggregation.

Yet another complication is the keyword DISTINCT that can precede the argument of any standard aggregate function. Again, functions max and min are easy as they are idempotent: the value of max or min remains same no matter how many times this value occurs as an argument.

Functions like (distinct attr_val) and sum(distinct attr_val) require preliminary aggregation on attr_val that actually removes duplicates.

Can an Optimizer Do This?

Looks like our transformation can be defined in a pretty formal way. It is known for very long time (see, for example, [2]. Can a database optimizer do this transformation automatically?

The answer is: yes, it can, but, most likely, the optimizer of your favorite RDBMS does not do it. In this section we discuss why developers of your favorite optimizer decided not to do that.

Most optimizers use some variations of the dynamic programming algorithm known since the late 70s. This algorithm finds the best (according to the value of the cost function) order of operations for SPJ (select-project-join) queries. Actually, handling of S and P is easy, so the optimization problem is also widely known as a problem of join order selection. In other words, such optimizers accept queries containing joins only. What to do with other operations? Some optimizers just split the query on SPJ parts and optimize them separately. For example, joins before and after the aggregation are optimized separately. We cannot see that from the execution plans, because the order is the same. However, in general our transformation may force early execution of some joins that the optimizer would execute later if all joins would be optimized together.

The benefits of our transformation are conditional: the cost of additional eager aggregation must be less than the gain on subsequent join operations due to reduction of the argument size. This means an increase in computational complexity that might be undesirable.

The good news is that researchers already found efficient techniques suitable for these kinds of query transformations, including more difficult transformations for outer joins and semi-joins [3]. We did not discuss joins other than inner in this article.

So, if your optimizer does not do eager aggregation yet, you can try doing it manually. Before doing it estimate how significant the reduction of size will be and choose the right place for eager aggregation. Most likely, the query that can benefit from eager aggregation contains joins of large and small tables. All large tables should be joined before the intermediate grouping.

References

  1. Henrietta Dombrovskaya, Boris Novikov, and Anna Bailliekova. Post- greSQL Query Optimization. Apress, 2021. URL: http://doi.org/10. 1007/978-1-4842-6885-8, doi:10.1007/978-1-4842-6885-8.
  2. Yan,W.,Larson,P.A.:Eageraggregationandlazyaggregation.In: Proceedings of International Conference on Very Large Data Bases (VLDB), vol 95, pp. 345–357 (1995)
  3. Marius Eich, Pit Fender, Guido Moerkotte: Efficient generation of query plans containing group-by, join, and groupjoin. VLDB J. 27(5): 617-641 (2018)

 

The post Eager Aggregation in SQL queries appeared first on Simple Talk.



from Simple Talk https://ift.tt/Ihy6dZf
via

Tuesday, December 26, 2023

Introducing the MongoDB Document

MongoDB is a document database. As such, the data is stored as individual documents. A document is a data structure made up of one or more field/value pairs. Nearly everything you do in MongoDB is either directly or indirectly related to the documents that you store in a database or move in and out of a database. The better you understand how documents work, the more effectively you can write queries and manage the data.

In my previous article, which was the first in this series, I introduced you to MongoDB and described how to get started with MongoDB Atlas and MongoDB Compass. Atlas provides a cloud-based database service comparable to on-premises MongoDB, and Compass serves as a client interface for connecting to MongoDB and working with document data.

As part of this discussion, I also covered some of the basics of MongoDB documents, demonstrating how to create them and add them to your database. In this article, I expand on this discussion to give you a better sense of how documents are constructed and the different types of data they can contain. To help with this process, I provide several examples that demonstrate some of the ways you can define documents to meet your specific business needs.

Note: For the examples in this article, I used the same MongoDB Atlas and MongoDB Compass environments I set up for the first article. If you want to try out these examples and are uncertain how to connect to Atlas, refer to the first article for more information.

MongoDB document basics

In MongoDB, a document is made up of one or more of more field/value pairs that are enclosed in a set of curly brackets and separated by commas. A field might contain a single value or contain multiple values, as in the case of arrays or embedded documents. Each value within a document is defined with a specific data type that determines how MongoDB handles the data.

A MongoDB document is always part of a collection, which in turn is always part of a database. A database can contain multiple collections, and each collection can contain multiple documents.

This structure is similar to a relational database, in which data is organized into tables and rows A table is comparable to a MongoDB collection, and a row is comparable to a MongoDB document. However, documents in a MongoDB collection do not have to conform to the strict schema restrictions that are imposed on rows in a table, offering far greater flexibility when storing data.

MongoDB stores documents as Binary JSON (BSON), which is a binary representation of JSON documents that extends the number of supported data types. Overall, the document structure itself is fairly straightforward, consisting of one or more field/value pairs, as shown in the following syntax:

{
  field1: value1,
  field2: value2,
  field3: value3,
  ...
  fieldN: valueN
}

Although this structure itself offers a great deal of flexibility, MongoDB still imposes multiple limitations on each document. For example, the maximize size of a document is 16 MB, and there can be no more than 100 nested levels (arrays and embedded documents).

In addition, a document cannot include duplicate field names. If you add a document with duplicate field names, MongoDB will sometimes drop one of the duplicate fields without returning an error or giving you any indication that something is wrong.

Sometimes you might not realize you ran up against a limitation until you try to insert a document into a collection and receive an error. Unfortunately, the error message you receive might not provide you with any useful information, leaving you to do a lot of trial-and-error to pinpoint the problem. In some cases, however, you can find an answer in the MongoDB documentation, starting with the topic https://www.mongodb.com/docs/manual/reference/limits/.

With that in mind, consider the following example, which defines a basic document that contains only three fields:

{
  "name": "Drew",
  "grade": 6,
  "reviews": 9.4 
}

This document includes the following field/value pairs, with each field containing only a single value:

  • The name field has a value of Drew, which is defined with the String data type. This is usually the most common data type used in MongoDB. It is based on the UTF-8 Unicode encoding standard.
  • The grade field has a value of 6, which is defined with the Int32 data type. The data type is used for 32-bit integer values.
  • The reviews field has a value of 9.4, which is defined with the Double data type. This data type is used for floating point numbers, which means they can contain decimal points for greater precision.

That’s all there is to defining a basic document. When you insert a document like this into a collection, MongoDB automatically determines the data type for each field based on its value and whether the value is enclosed in quotation marks. As a general rule, MongoDB interprets quoted values as string values and non-quoted values as numeric values, although it’s possible to specify a specific data type when inserting the document, as you’ll see in the next section.

Adding the _id field to a document

Each document in a collection must include an _id field that serves as the document’s primary key. The field’s value is immutable and must be unique within the collection. If you try to add a document to a collection that contains the same _id value as an existing document, MongoDB will return an error, without inserting the document into the collection.

When you add a document to a collection, you can choose to specifically include the _id field or you can let MongoDB add the field for you. If MongoDB adds the field, it will contain a 12-byte hexadecimal value (24 characters), and the value will be defined with the ObjectId data type. For example, the value might look something like 653fcc59b4121d2fe701df04.

If you choose to add the _id field yourself, you can include it in your document definition just like any other field. This approach provides you with more control over your documents. For example, you might be importing documents and want to preserve their original primary keys. The following example shows a document definition that includes the _id field:

{
  "_id": 1001,
  "name": "Drew",
  "grade": 6,
  "reviews": 9.4
}

In this case, the _id value is added as an integer, but you can specify any type of value except an array, as long as the value is unique within the collection. You can even specify a 12-byte hexadecimal value like the type that MongoDB automatically generates, as in the following document definition.

{
  "_id": "653d35cdcea93f2aea8abbce",
  "name": "Drew",
  "grade": 6,
  "reviews": 9.4
}

In this example, the _id value as defined with the String data type, rather than the ObjectId data type. However, the ObjectId data type offers a couple advantages over String, in part because an auto-generated ObjectId value incorporates a timestamp that reflects when the value was generated. For example, you can use the ObjectId.getTimestamp() method to extract the creation time of the ObjectId value, and you can sort your documents by their ObjectId values, which is roughly equivalent to sorting by creation times.

To save a 12-byte hexadecimal value as an ObjectId value, you must instruct MongoDB to assign that data type. Some MongoDB examples indicate that you should use the ObjectId constructor to specify the ObjectId data type, as in the following example:

{
  "_id": ObjectId("653d35cdcea93f2aea8abbce"),
  "name": "Drew",
  "grade": 6,
  "reviews": 9.4
}

However, if you try to use the ObjectId constructor in Compass when adding a document, the Insert Document dialog box will display the error shown in the following figure.

The issue here is that MongoDB clients vary in their requirements when communicating with a MongoDB database. The Compass interface relies on MongoDB Extended JSON, which refers to the extensions MongoDB adds to the JSON format. Because Compass runs in strict mode, as it relates to Extended JSON, certain statement elements behave differently than they might with other clients. For example, Compass and the MongoDB Shell take different approaches to specifying a field’s data type when inserting a document.

A good example of this is the ObjectId constructor. Because Compass run in strict mode, you cannot use the ObjectId constructor. Instead, you must use the $oid operator, as shown in the following example:

{
  "_id": { "$oid": "653d35cdcea93f2aea8abbce" },
  "name": "Avery",
  "grade": 6,
  "reviews": 9.4
}

When using the $oid operator, you must define the _id value as an embedded document that includes one field/value pair, with the $oid operator used for the field name and the hexadecimal string for its value. MongoDB will then define the _id value with the ObjectId data type.

Adding a data field to a document

When using Compass or another client that requires strict adherence to Extended JSON, you’ll likely run into other situations similar to the ObjectId constructor, in which case, you’ll need to determine the best way to specify a data type.

For example, your documents might include one or more fields that contain date values. You can, of course, add a date as a string value, as in the following document definition:

{
  "_id": { "$oid": "653d35cdcea93f2aea8abbce" },
  "name": "Avery",
  "grade": 6,
  "reviews": 9.4,
  "hire_date": "May 13, 2016"
}

However, a string value cannot take advantage of the methods and properties available to the Date data type, such as being able to easily retrieve parts of a date, like year or month. In addition, when you save a date such as May 13, 2016 as a Date object, MongoDB automatically converts it to the UTC datetime format. For example, when I added May 13, 2016 as a Date object on my system, MongoDB saved the value as 2016-05-13T07:00:00.000+00:00.

As with the ObjectId constructor, MongoDB provides the new Date constructor for specifying that a value should be defined with the Date data type, as shown in the following example:

{
  "_id": { "$oid": "653d35cdcea93f2aea8abbce" },
  "name": "Avery",
  "grade": 6,
  "reviews": 9.4,
  "hire_date": new Date("May 13, 2016")
}

Although you’ll see plenty of examples that use this constructor, it won’t work in Compass because you’ll run up against the same issue we ran into with the ObjectId constructor. As a result, you’ll need to recast the field definition, only this time, specifying the $date operator:

{
  "_id": { "$oid": "653d35cdcea93f2aea8abbce" },
  "name": "Avery",
  "grade": 6,
  "reviews": 9.4,
  "hire_date": { "$date": "May 13, 2016" }
}

Now MongoDB will accept this document with no problem, although you might run into this issue with other data types, depending on your document definitions. Fortunately, extended JSON also supports other operators for defining data types when you want to control your data type assignments. For more information, check out the MongoDB topic MongoDB Extended JSON (v2).

Embedding fields in other fields

MongoDB also supports a variety of other data types, in addition to what we’ve already looked at here. Two of the most valuable data types are Array and Object. Both of these let you embed multiple values in a field, making it possible to create additional data layers within your document structure.

The best way to understand the Array and Object data types is to see them in action. The following document builds on the previous example by adding a field whose value is defined with an Object data type and a field whose value is defined with an Array type:

{
  "_id": { "$oid": "653d35cdcea93f2aea8abbce" },
  "name": "Drew",
  "position": { 
    "department": "R&D", 
    "title": "Senior Developer", 
    "grade": 6,
    "reviews": 9.4 
  },
  "hire_date": { "$date": "May 13, 2016" },
  "current_emp": true,
  "skills": [ "Java", "SQL", "Python", "PHP" ]
}

The position field is the first of the new fields. Its value is an embedded document, which means it will be defined with the Object data type when you add the document to a collection. As such, the value’s elements (field/value pairs) are enclosed in curly braces and separated by commas, just like a top-level document. In this case, the embedded document includes four field/value pairs:

  • The department value will be defined with the String data type.
  • The title value will be defined with the String data type.
  • The grade value will be defined with the Int32 data type.
  • The reviews value will be defined with the Double data type.

Together, these four fields are combined into a single document to form the position value. Notice that you can mix data types as necessary, just like a top-level document that you would insert into a collection.

The document in the example above also includes the skills field, which is defined with an Array data type. An Array value is simply a set of values enclosed in square brackets and separated with commas. In this case, the values are Java, SQL, Python, and PHP, all of which are String values. However, an array can also include values that are defined with different data types.

You might have noticed that I added the current_emp field to the document as well. This field’s value is defined with the Boolean data type, which can take a value of either true or false (not enclosed in quotes). I included the field here only to point out another available data type.

In fact, MongoDB supports many data types. As we progress through this series, I hope to introduce some of those types so you have a more complete picture of the various types of data that MongoDB supports. In the meantime, you can find a complete list of data types in the MongoDB topic BSON Types.

Embedding documents in an array

As you saw in the previous section, the Object and Array data types provide you with a great deal of flexibility when creating your documents. But they don’t stop there. MongoDB also lets you embed one type of value within the other type. For example, you can create an Array field whose values are made up of multiple embedded documents, as shown in the following example:

{
  "_id": { "$oid": "653d35cdcea93f2aea8abbce" },
  "name": "Drew",
  "position": { 
    "department": "R&D", 
    "title": "Senior Developer", 
    "grade": 6,
    "reviews": 9.4 
  },
  "hire_date": { "$date": "May 13, 2016" },
  "current_emp": true,
  "education": [
    { "school": "MIT", 
      "degree": "bachelor's",
      "major": "software engineering",
      "gpa": 3.78 },
    { "school": "UC Berkeley", 
      "degree": "master's",
      "major": "computer science",
      "gpa": 3.89 } 
  ],
  "skills": [ "Java", "SQL", "Python", "PHP" ]
}

The document builds on the previous example by adding the education field, which is defined with the Array data type. The array contains two Object values, each one an embedded document that describes some aspect of Drew’s education.

Each document in the education field contains the same three String fields and one Double field. However, you could have defined the documents with different fields or with the same fields and different types, including additional arrays or embedded documents, making it possible to create documents with multiple nested layers. As noted earlier, however, you are limited to 100 nested levels, although this still gives you an incredible amount of flexibility over how you define your documents.

With the addition of the education field, the document now contains many of the elements you’ll likely run up against when working with MongoDB data. Although this document by no means represents everything you can do in MongoDB, it gives you a good sense of a document’s flexibility—as well as its potential complexity. Even so, the document is still only a set of field/value pairs.

To help better understand the document above, the following table provides a breakdown of all the fields and values that it now contains.

Field

Value

Data type

Type alias

Type

number

_id

653d35cdcea93f2aea8abbce

ObjectId

objectid

7

name

Drew

String

string

2

position

{embedded document with 3 elements}

Object

object

3

position.department

R&D

String

string

2

position.title

Senior Developer

String

string

2

position.grade

6

Int32

int

16

position.reviews

9.4

Double

double

1

hire_date

2016-05-13T07:00:00.000+00:00

Date

date

9

current_emp

true

Boolean

bool

8

education

[array with 2 embedded documents]

Array

array

4

education[0]

{embedded document with 4 elements}

Object

object

3

education[0].school

MIT

String

string

2

education[0].degree

bachelor’s

String

string

2

education[0].major

software engineering

String

string

2

education[0].gpa

3.78

Double

double

1

education[1]

{embedded document with 4 elements}

Object

object

3

education[1].school

UC Berkeley

String

string

2

education[1].degree

master’s

String

string

2

education[1].major

computer science

String

string

2

education[1].gpa

3.89

Double

double

1

skills

[array with 4 elements]

Array

array

4

skills[0]

Java

String

string

2

skills[1]

SQL

String

string

2

skills[2]

Python

String

string

2

skills[3]

PHP

String

string

2

The table also shows the official name, alias and numeric identifier that MongoDB assigns to each of the document’s data types. You might see the aliases or numeric identifiers used in code examples or in other places, so I thought it would be useful to include them here.

Notice that the array fields shown in the table are also identified by their index numbers. MongoDB automatically assigns these number to the array elements, using 0-based indexing to identify each element. For example, the skills array includes four elements, which are numbered 0 through 3. In this way, each element can be easily referenced when querying the document (a topic I’ll be discussing later in the series).

Adding the document to a collection

In the previous article in these series, I explained how to add a document to collection in Compass. First, you select the collection in the left panel and then, in the main window, click the Add Data drop-down arrow and click Insert document. In the Insert Document dialog box, delete the existing text, type or paste the document code, and then click Insert.

For this article, I created a database named hr and a collection named employees. I then added the document in the previous example (in the previous section entitled “Embedding Documents into an Array”) to that collection. The following figure shows the document in List View, with the document fully expanded. To expand a document, hover over the document until the Expand all down-arrow appears near the top left corner (to the left of the _id field) and then click the arrow.

Notice that Compass shows the data type for the Array fields and Object fields. For the Array data type, Compass also displays the number of elements within the array. For example, Compass shows that the education array contains two elements, which are the two embedded documents.

List View does not show the data types for the other fields, although it shows the ObjectId constructor preceding with the _id value, indicating that the value is defined with the ObjectId data type.

Note: List View will show all the data types if you double-click one of the document’s elements as though you were going to edit that value, something I’ll be discussing later in the series.

You can also see the data types for all the fields by viewing the document in Table View. When you first switch to Table View, Compass displays the top level fields and their data types, as shown in the following figure.

You can view the embedded fields and their data types by drilling down into the specific Array or Object value. For example, to view details about the embedded document in the position field, hover over the position value until the edit button appears, and then click that button. Compass will display the document’s four fields, their values, and their data types, as shown in the following figure.

After you finish viewing information about the embedded fields, you can move back to the document’s top level by clicking the employees portion of breadcrumb near the top left corner of the grid.

If you want to drill into the education field, you can take the same approach as with the position field. Hover over the field’s value and then click the edit button. This will move you down one level, which is shown in the following figure.

Of course, this still doesn’t show you the actual fields because you need to drill down into the individual document. For example, if you hover over the education[0] value in the grid—{} 4 fields—and click the edit button, you’ll be able to view the embedded fields, as shown in the following figure.

Once again, you’re able to view each field name, value, and data type. You can take the same approach with the skills array, which is shown in the next figure. As before, this view provides you with more details about the array’s values.

When working with MongoDB documents in Compass, you should have a good sense of how to find your way around the documents so you understand how they’re structured and what data types have been assigned to the field values. This can make it easier for you when you’re building your queries and you need to know how to find exactly what you’re looking for.

Getting started with MongoDB documents

Just about everything you do in MongoDB revolves around the stored documents. At its highest level, the document structure is quite basic. Each document is made up of one or more field/value pairs, separated by commas, with everything enclosed in curly brackets. However, a document can quickly become a complex structure when you start adding arrays and embedded documents, especially if you embed even more arrays and documents in the embedded fields. As this series progresses, we’ll be spending a lot of time on how to query these documents, but know that everything you do starts with having a strong foundation in the document structure.

 

The post Introducing the MongoDB Document appeared first on Simple Talk.



from Simple Talk https://ift.tt/MZcQ8Jx
via

Thursday, December 21, 2023

Counting more efficiently

Nearly a decade ago, I wrote a post called “Bad habits : Counting rows the hard way.” In that post, I talked about how we can use SQL Server’s metadata to instantly retrieve the row count for a table. Typically, people do the following, which has to read the entire table or index:

DECLARE @c int = (SELECT COUNT(*) FROM dbo.TableName);

To largely avoid size-of-data constraints, we can instead use sys.partitions.

Note: We could use OBJECT_ID() instead of a join, but that function doesn’t observe isolation semantics, so can cause blocking – or be a victim. A potential compromise is to create standardized metadata views, but I’ll leave that as an exercise for the reader.
DECLARE @object int = (SELECT o.object_id 
    FROM sys.objects AS o
      INNER JOIN sys.schemas AS s
        ON o.[schema_id] = s.[schema_id]
      WHERE o.name = N'TableName'
        AND s.name = N'dbo');

 DECLARE @c int = (SELECT SUM([rows]) 
    FROM sys.partitions
      WHERE index_id IN (0,1)
        AND object_id = @object);

That’s great when you want to count the whole table without size-of-entire-table reads. It gets more complicated if you need to retrieve the count of rows that meet – or don’t meet – some criteria. Sometimes an index can help, but not always, depending on how complex the criteria might be.

An example

One example is the soft delete, where we have a bit column like IsActive that defaults to 1, and only gets updated to 0 when a row gets “soft deleted.” Consider a table like this:

CREATE TABLE dbo.Users
 (
   UserID   int        NOT NULL,
   Filler   char(2000) NOT NULL DEFAULT '',
   IsActive bit        NOT NULL DEFAULT 1,
   CONSTRAINT PK_Users PRIMARY KEY (UserID)
 );

 -- insert 100 rows:
 INSERT dbo.Users(UserID) 
   SELECT TOP (100) ROW_NUMBER() OVER (ORDER BY object_id)
   FROM sys.all_columns
   ORDER BY object_id; 

 -- soft delete 3 rows:
 UPDATE dbo.Users SET IsActive = 0 WHERE UserID IN (66, 99, 4);

The metadata alone can’t tell us how many active users we have; only the total. But sometimes we want to know how many are active and how many are inactive. I have seen users implement such a query like this:

SELECT [status] = 'Active', [count] = COUNT(*) 
   FROM dbo.Users WHERE IsActive = 1
 UNION ALL
 SELECT [status] = 'Inactive', [count] = COUNT(*) 
   FROM dbo.Users WHERE IsActive = 0;

Results:

status       count
 -----------|-------|
 Active     |    97 |
 Inactive   |     3 |

Without any other indexes, this is achieved with two full scans:

A union produces two full scans
A union produces two full scans

A slightly more efficient query, that only scans the entire table once, is a conditional aggregation:

SELECT Active   = COUNT(CASE IsActive WHEN 1 THEN 1 END),
        InActive = COUNT(CASE IsActive WHEN 0 THEN 1 END)
 FROM dbo.Users;

Results:

Active    Inactive
 --------|-----------|
      97 |         3 |

Conditional aggregation: just one scan
Conditional aggregation: just one scan

Why not create an index?

Conventional wisdom has told us to not create indexes involving bit columns, because they’re not very selective. If you’re using such a column to highlight the minority (e.g. only a tiny fraction of users ever get soft deleted), then an index on just the IsActive column wouldn’t be extremely useful:

CREATE INDEX ix_IsActive ON dbo.Users(IsActive);

The above queries would still have to look at all the rows (the conditional aggregate would still perform a full scan, and the UNION would seek twice, at this low row count, but that could get worse at scale). Both plans:

An index might not be overly helpful
An index might not be overly helpful

But what if we had a filtered index on just the minority case (where IsActive = 0)?

DROP INDEX ix_IsActive ON dbo.Users;

 CREATE INDEX ix_IsInactive ON dbo.Users(IsActive) WHERE IsActive = 0;

This doesn’t help our existing queries much at all. The UNION can scan this smaller index to get the inactive count, but still has to scan the entire table for the active count. And the conditional aggregate still has to perform a full scan:

Even a filtered index might not be overly helpful
Even a filtered index might not be overly helpful

The filtered index brings the two queries closer, but both still have performance tied to the size of the table.

That said…

The filtered index buys us something else

Here’s a secret: with the filtered index, we can now satisfy both counts without ever touching the table at all. Assuming IsActive can’t be NULL, then active user count is simply the total subtract the inactive user count. So this query gets the same results without size-of-entire-index reads:

WITH idx AS
  (
    SELECT i.[object_id], i.index_id
    FROM sys.indexes AS i
      INNER JOIN sys.objects AS o
      ON i.[object_id] = o.[object_id]
      INNER JOIN sys.schemas AS s
      ON o.[schema_id] = s.[schema_id]
      WHERE o.name = N'Users'
        AND s.name = N'dbo'
        AND 
        (
          i.index_id IN (0,1)
          OR i.filter_definition = N'([IsActive]=(0))'
        )
  ), agg AS
  (
    SELECT Total    = SUM(CASE WHEN p.index_id IN (0,1) THEN p.[rows] END), 
           Inactive = SUM(CASE WHEN p.index_id > 1      THEN p.[Rows] END)
    FROM sys.partitions AS p
    INNER JOIN idx 
      ON p.[object_id] = idx.[object_id]
      AND p.index_id = idx.index_id
  )
  SELECT Active = Total - Inactive, 
         Inactive
  FROM agg;

Now, the query is a lot more complicated than it absolutely has to be, and produces a plan that is far more elaborate than you might expect (so elaborate I’m not going to show it). But you can simplify. If you know the index_id for the filtered index (in my case, 3), for example, and are confident in OBJECT_ID:

DECLARE @filter tinyint = 3;

 WITH agg AS
 (
   SELECT Total    = SUM(CASE WHEN index_id <> @filter THEN [rows] END), 
          Inactive = SUM(CASE WHEN index_id =  @filter THEN [Rows] END)
   FROM sys.partitions
   WHERE [object_id] = OBJECT_ID(N'dbo.Users')
     AND index_id IN (0, 1, @filter)
 )
 SELECT Active = Total - Inactive, 
        Inactive
 FROM agg;

You might not want to memorize the index_id, or hard-code that in production, since indexes can be dropped and re-created in any order. But if you can trust the index name to be stable, and for the filter definition to be accurate, you could replace the first line with the following:

DECLARE @filter tinyint;

 SELECT @filter = index_id 
   FROM sys.indexes
   WHERE name = N'ix_IsInactive'
     AND has_filter = 1
     AND [object_id] = OBJECT_ID(N'dbo.Users');

In either case, you may want error handling to indicate if the index wasn’t found.

Other examples

I used bit columns here to dovetail off an earlier post entitled “Strategies for queries against bit columns.” But this technique isn’t limited to one of two values with significant skew; it can be applied in many more scenarios.

For example, I’ve had dashboards that pull subsets of tables to give counts of things such as some related set of error status codes, and then a filtered index like WHERE error_code IN (1, 12, 18456). Now the dashboard can go right to the metadata and pull a single count directly.

You can do similar things with date ranges, e.g. have a filtered index re-created at the beginning of each month (or any other period), with a hard-coded start range, to simulate partitioning (at least for the current “partition”) without all the work. The metadata for that index will always show the total count of rows for the current month.

And you could combine these as well; for example, a filtered index against some set of error codes this month. Or a filtered index for each error code this month, if they’re important independently, too.

Conclusion

In spite of ugly-looking underlying execution plans against system objects, this technique should be much faster than querying the table – particularly at scale. The only exception might be when the table is small or otherwise always in memory, and sys.partitions is massive (e.g. you have a ridiculous number of tables, partitions, and indexes). Going after the metadata is a technique you may want to consider if you’re performing a lot of queries for counts, filtered or not.

The post Counting more efficiently appeared first on Simple Talk.



from Simple Talk https://ift.tt/xfRtcpS
via