This section describes how DB2 UDB for AS/400 implements grouping techniques and how optimization choices are made by the query optimizer. The query optimizer has two choices for implementing grouping: the hash implementation or the index implementation.
This technique uses the base hash access method to perform grouping or summarization of the selected table rows. For each selected row, the specified grouping value is run through the hash function. The computed hash value and grouping value are used to quickly find the entry in the hash table corresponding to the grouping value. If the current grouping value already has a row in the hash table, the hash table entry is retrieved and summarized (updated) with the current table row values based on the requested grouping column operations (such as SUM or COUNT). If a hash table entry is not found for the current grouping value, a new entry is inserted into the hash table and initialized with the current grouping value.
The time required to receive the first group result for this implementation will most likely be longer than other grouping implementations because the hash table must be built and populated first. Once the hash table is completely populated, the database manager uses the table to start returning the grouping results. Before returning any results, the database manager must apply any specified grouping selection criteria or ordering to the summary entries in the hash table.
Where the grouping hash method is most effective
The grouping hash method is most effective when the consolidation ratio is high. The consolidation ratio is the ratio of the selected table rows to the computed grouping results. If every database table row has its own unique grouping value, then the hash table will become too large. This in turn will slow down the hashing access method.
The optimizer estimates the consolidation ratio by first determining the number of unique values in the specified grouping columns (that is, the expected number of groups in the database table). The optimizer then examines the total number of rows in the table and the specified selection criteria and uses the result of this examination to estimate the consolidation ratio.
Indexes over the grouping columns can help make the optimizer's ratio estimate more accurate. Indexes improve the accuracy because they contain statistics that include the average number of duplicate values for the key columns.
The optimizer also uses the expected number of groups estimate to compute the number of partitions in the hash table. As mentioned earlier, the hashing access method is more effective when the hash table is well-balanced. The number of hash table partitions directly affects how entries are distributed across the hash table and the uniformity of this distribution.
The hash function performs better when the grouping values consist of columns that have non-numeric data types, with the exception of the integer (binary) data type. In addition, specifying grouping value columns that are not associated with the variable length and null column attributes allows the hash function to perform more effectively.
This implementation utilizes the index scan-key selection or index scan-key positioning access methods to perform the grouping. An index is required that contains all of the grouping columns as contiguous leftmost key columns. The database manager accesses the individual groups through the index and performs the requested summary functions.
Since the index, by definition, already has all of the key values grouped together, the first group result can be returned in less time than the hashing method. This is because of the temporary result that is required for the hashing method. This implementation can be beneficial if an application does not need to retrieve all of the group results or if an index already exists that matches the grouping columns.
When the grouping is implemented with an index and a permanent index does not already exist that satisfies grouping columns, a temporary index is created. The grouping columns specified within the query are used as the key columns for this index.
All of the grouping columns are evaluated to determine if they can be removed from the list of grouping columns. Only those grouping columns that have isolatable selection predicates with an equal operator specified can be considered. This guarantees that the column can only match a single value and will not help determine a unique group. This processing is done to allow the optimizer to consider more indexes to implement the query and to reduce the number of columns that will be added as key columns to a temporary index or hash table.
The following example illustrates a query where the optimizer could eliminate a grouping column.
DECLARE DEPTEMP CURSOR FOR
SELECT EMPNO, LASTNAME, WORKDEPT
FROM CORPDATA.EMPLOYEE
WHERE EMPNO = '000190'
GROUP BY EMPNO, LASTNAME, WORKDEPT
OPNQRYF example:
OPNQRYF FILE(EMPLOYEE) FORMAT(FORMAT1)
QRYSLT('EMPNO *EQ ''000190''')
GRPFLD(EMPNO LASTNAME WORKDEPT)
In this example, the optimizer can remove EMPNO from the list of grouping columns because of the EMPNO = '000190' selection predicate. An index that only has LASTNAME and WORKDEPT specified as key columns can be considered to implement the query and if a temporary index or hash is required then EMPNO will not be used.
| Note: | Even though EMPNO can be removed from the list of grouping columns, the optimizer might still choose to use that index if a permanent index exists with all three grouping columns. |
The same logic that is applied to removing grouping columns can also be used to add additional grouping columns to the query. This is only done when you are trying to determine if an index can be used to implement the grouping.
The following example illustrates a query where the optimizer could add an additional grouping column.
CREATE INDEX X1 ON EMPLOYEE
(LASTNAME, EMPNO, WORKDEPT)
DECLARE DEPTEMP CURSOR FOR
SELECT LASTNAME, WORKDEPT
FROM CORPDATA.EMPLOYEE
WHERE EMPNO = '000190'
GROUP BY LASTNAME, WORKDEPT
OPNQRYF example:
OPNQRYF FILE ((EMPLOYEE)) FORMAT(FORMAT1)
QRYSLT('EMPNO *EQ ''000190''')
GRPFLD(LASTNAME WORKDEPT)
For this query request, the optimizer can add EMPNO as an additional grouping column when considering X1 for the query.
Index Skip Key processing can be used when grouping with the keyed sequence implementation algorithm which uses an existing index. The index skip key processing algorithm:
This will improve performance by potentially not processing all index key values for a group.
Index skip key processing can be used:
Example 1, using SQL:
CREATE INDEX IX1 ON EMPLOYEE (SALARY DESC) DECLARE C1 CURSOR FOR SELECT MAX(SALARY) FROM EMPLOYEE;
Example 1, using the OPNQRYF command:
OPNQRYF FILE(EMPLOYEE) FORMAT(FORMAT1) MAPFLD((MAXSAL '%MAX(SALARY)'))
The query optimizer will chose to use the index IX1. The SLIC runtime code will scan the index until it finds the first non-null value for SALARY. Assuming that SALARY is not null, the runtime code will position to the first index key and return that key value as the MAX of salary. No more index keys will be processed.
Example 2, using SQL:
CREATE INDEX IX2 ON EMPLOYEE (DEPT, JOB,SALARY) DECLARE C1 CURSOR FOR SELECT DEPT, MIN(SALARY) FROM EMPLOYEE WHERE JOB='CLERK' GROUP BY DEPT
Example 2, using the OPNQRYF command:
OPNQRYF FILE(EMPLOYEE) FORMAT(FORMAT2)
QRYSLT('JOB *EQ ''CLERK''')
GRPFLD((DEPT))
MAPFLD((MINSAL '%MIN(SALARY)'))
The query optimizer will chose to use Index IX2. The SLIC runtime code will position to the first group for DEPT where JOB equals 'CLERK' and will return the SALARY. The code will then skip to the next DEPT group where JOB equals 'CLERK'.
Example 1, using SQL:
CREATE INDEX IX1 ON DEPARTMENT(DEPTNAME)
CREATE INDEX IX2 ON EMPLOYEE(WORKDEPT, SALARY)
DECLARE C1 CURSOR FOR
SELECT DEPTNAME, MIN(SALARY)
FROM DEPARTMENT, EMPLOYEE
WHERE DEPARTMENT.DEPTNO=EMPLOYEE.WORKDEPT
GROUP BY DEPARTMENT.DEPTNO;
Example 1, using the OPNQRYF command:
OPNQRYF FILE(DEPARTMENT EMPLOYEE) FORMAT(FORMAT1) JFLD((1/DEPTNO 2/WORKDEPT *EQ)) GRPFLD((1/DEPTNO)) MAPFLD((MINSAL '%MIN(SALARY)'))