Grouping optimization

This section describes how DB2 UDB for AS/400 implements grouping techniques and how optimization choices are made by the query optimizer. The query optimizer has two choices for implementing grouping: the hash implementation or the index implementation.

Grouping hash implementation

This technique uses the base hash access method to perform grouping or summarization of the selected table rows. For each selected row, the specified grouping value is run through the hash function. The computed hash value and grouping value are used to quickly find the entry in the hash table corresponding to the grouping value. If the current grouping value already has a row in the hash table, the hash table entry is retrieved and summarized (updated) with the current table row values based on the requested grouping column operations (such as SUM or COUNT). If a hash table entry is not found for the current grouping value, a new entry is inserted into the hash table and initialized with the current grouping value.

The time required to receive the first group result for this implementation will most likely be longer than other grouping implementations because the hash table must be built and populated first. Once the hash table is completely populated, the database manager uses the table to start returning the grouping results. Before returning any results, the database manager must apply any specified grouping selection criteria or ordering to the summary entries in the hash table.

Where the grouping hash method is most effective

The grouping hash method is most effective when the consolidation ratio is high. The consolidation ratio is the ratio of the selected table rows to the computed grouping results. If every database table row has its own unique grouping value, then the hash table will become too large. This in turn will slow down the hashing access method.

The optimizer estimates the consolidation ratio by first determining the number of unique values in the specified grouping columns (that is, the expected number of groups in the database table). The optimizer then examines the total number of rows in the table and the specified selection criteria and uses the result of this examination to estimate the consolidation ratio.

Indexes over the grouping columns can help make the optimizer's ratio estimate more accurate. Indexes improve the accuracy because they contain statistics that include the average number of duplicate values for the key columns.

The optimizer also uses the expected number of groups estimate to compute the number of partitions in the hash table. As mentioned earlier, the hashing access method is more effective when the hash table is well-balanced. The number of hash table partitions directly affects how entries are distributed across the hash table and the uniformity of this distribution.

The hash function performs better when the grouping values consist of columns that have non-numeric data types, with the exception of the integer (binary) data type. In addition, specifying grouping value columns that are not associated with the variable length and null column attributes allows the hash function to perform more effectively.

Index grouping implementation

This implementation utilizes the index scan-key selection or index scan-key positioning access methods to perform the grouping. An index is required that contains all of the grouping columns as contiguous leftmost key columns. The database manager accesses the individual groups through the index and performs the requested summary functions.

Since the index, by definition, already has all of the key values grouped together, the first group result can be returned in less time than the hashing method. This is because of the temporary result that is required for the hashing method. This implementation can be beneficial if an application does not need to retrieve all of the group results or if an index already exists that matches the grouping columns.

When the grouping is implemented with an index and a permanent index does not already exist that satisfies grouping columns, a temporary index is created. The grouping columns specified within the query are used as the key columns for this index.

Optimizing grouping by eliminating grouping columns

All of the grouping columns are evaluated to determine if they can be removed from the list of grouping columns. Only those grouping columns that have isolatable selection predicates with an equal operator specified can be considered. This guarantees that the column can only match a single value and will not help determine a unique group. This processing is done to allow the optimizer to consider more indexes to implement the query and to reduce the number of columns that will be added as key columns to a temporary index or hash table.

The following example illustrates a query where the optimizer could eliminate a grouping column.

    DECLARE DEPTEMP CURSOR FOR
      SELECT EMPNO, LASTNAME, WORKDEPT
        FROM CORPDATA.EMPLOYEE
        WHERE EMPNO = '000190'
        GROUP BY EMPNO, LASTNAME, WORKDEPT
 

OPNQRYF example:

OPNQRYF FILE(EMPLOYEE) FORMAT(FORMAT1)
   QRYSLT('EMPNO *EQ ''000190''')
   GRPFLD(EMPNO LASTNAME WORKDEPT)

In this example, the optimizer can remove EMPNO from the list of grouping columns because of the EMPNO = '000190' selection predicate. An index that only has LASTNAME and WORKDEPT specified as key columns can be considered to implement the query and if a temporary index or hash is required then EMPNO will not be used.
Note:Even though EMPNO can be removed from the list of grouping columns, the optimizer might still choose to use that index if a permanent index exists with all three grouping columns.

Optimizing grouping by adding additional grouping columns

The same logic that is applied to removing grouping columns can also be used to add additional grouping columns to the query. This is only done when you are trying to determine if an index can be used to implement the grouping.

The following example illustrates a query where the optimizer could add an additional grouping column.

    CREATE INDEX X1 ON EMPLOYEE
          (LASTNAME, EMPNO, WORKDEPT)
 
    DECLARE DEPTEMP CURSOR FOR
      SELECT LASTNAME, WORKDEPT
        FROM CORPDATA.EMPLOYEE
        WHERE EMPNO = '000190'
        GROUP BY LASTNAME, WORKDEPT
 
 
 

OPNQRYF example:

OPNQRYF FILE ((EMPLOYEE)) FORMAT(FORMAT1)
   QRYSLT('EMPNO *EQ ''000190''')
   GRPFLD(LASTNAME WORKDEPT)
 

For this query request, the optimizer can add EMPNO as an additional grouping column when considering X1 for the query.

Optimizing grouping by using index skip key processing

Index Skip Key processing can be used when grouping with the keyed sequence implementation algorithm which uses an existing index. The index skip key processing algorithm:

  1. Uses the index to position to a group and
  2. finds the first row matching the selection criteria for the group, and if specified the first non-null MIN or MAX value in the group
  3. Returns the group to the user
  4. "Skip" to the next group and repeat processing

This will improve performance by potentially not processing all index key values for a group.

Index skip key processing can be used:


[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]