Lecture Week6
Lecture Week6
Grouping data.
Aggregate Functions in SQL.
Grouping is used to create subgroups of tuples before summarization. Grouping and aggregation
are required in many database applications, and we will introduce their use in SQL through
examples.
Aggregate functions are used to summarize information from multiple tuples into a single-tuple
summary. In other words, aggregate functions perform a calculation on a set of rows and return a
single row.
We often use the aggregate functions with the GROUP BY clause in the SELECT statement. In
these cases, the GROUP BY clause divides the result set into groups of rows and the aggregate
functions perform a calculation on each group e.g., maximum, minimum, average, etc.
These functions can be used in the SELECT clause or in a HAVING clause (which we introduce
later).
The functions MAX and MIN can also be used with attributes that have nonnumeric domains if
the domain values have a total ordering among one another (for example, with dates).
COMPANY DB structure
Example 1.
Find the sum of the salaries of all employees, the maximum salary, the minimum salary, and the
average salary:
FROM EMPLOYEE;
If we want to get the preceding function values for employees of a specific department—say, the
‘Research’ department—we can write query, where the EMPLOYEE tuples are restricted by the
WHERE clause to those employees who work for the ‘Research’ department.
Example 2.
Find the sum of the salaries of all employees of the ‘Research’ department, as well as the
maximum salary, the minimum salary, and the average salary in this department:
WHERE Dname=‘Research’;
Example 3.
FROM EMPLOYEE;
Here the asterisk (*) refers to the rows (tuples), so COUNT (*) returns the number of rows in the
result of the query. We may also use the COUNT function to count values in a column rather
than tuples, as in the next example.
Example 4.
FROM EMPLOYEE;
The following statement illustrates various ways of using the COUNT() function.
COUNT(*)
The COUNT(*) function returns the number of rows returned by a SELECT statement, including
NULL and duplicates:
SELECT
COUNT(*)
FROM
table_name
WHERE
condition;
When you apply the COUNT(*) function to the entire table, PostgreSQL has to scan the whole
table sequentially. If you use the COUNT(*) function on a big table, the query will be slow.
COUNT(column)
Similar to the COUNT(*) function, the COUNT(column) function returns the number of rows returned
by a SELECT clause. However, it does not consider NULL values in the column:
SELECT
COUNT(column)
FROM
table_name
WHERE
condition;
COUNT(DISTINCT column)
In this form, the COUNT(DISTINCT column) returns the number of unique non-null values in
the column.
SELECT
COUNT(DISTINCT column)
FROM
table_name
WHERE
condition;
We often use the COUNT() function with the GROUP BY clause to return the number of items
for each group. For example, we can use the COUNT() with the GROUP BY clause to return the
number of films in each film category.
The following statement uses the COUNT(*) function to return the number of transactions in
the payment table:
SELECT
COUNT(*)
FROM
payment;
Result:
To get the distinct amounts which customers paid, you use the COUNT(DISTINCT amount) function
as shown in the following example:
SELECT
COUNT(DISTINCT amount)
FROM
payment;
Result:
To get the number of payments by the customer, you use the GROUP BY clause to group the
payments into groups based on customer id, and use the COUNT() function to count the payments
for each group.
The following query illustrates the idea:
SELECT
customer_id,
COUNT (customer_id)
FROM
payment
GROUP BY
customer_id;
Result:
If you want to know the average amount that customers paid, you can apply the AVG function on
the amount column as the following query:
SELECT AVG(amount)
FROM payment;
Result:
SELECT ROUND(AVG(amount))
FROM payment;
Or
SELECT AVG(amount)::numeric(8,2)
FROM payment;
The following query uses the MAX() function to find the highest amount paid by customers in
the payment table:
SELECT MAX(amount)
FROM payment;
Result:
To get information of the customer with the highest payment, you use a subquery as follows:
Result:
The following diagram illustrates the steps that PostgreSQL performs the query:
To find the minimum value in a column of a table, you pass the name of the column to
the MIN() function. The data type of the column can be number, string, or any comparable type.
The following example uses the MIN() function to get the lowest rental rate from
the rental_rate column the film table:
SELECT
MIN (rental_rate)
FROM
film;
Result:
To get films which have the lowest rental rate, you use the following query:
SELECT
film_id,
title,
rental_rate
FROM
film
WHERE
rental_rate = (
SELECT MIN(rental_rate)
FROM film
);
Result:
The SUM() function ignores NULL. It means that SUM() doesn’t consider the NULL in calculation.
If you use the DISTINCT option, the SUM() function calculates the sum of distinct values.
For example, without the DISTINCT option, the SUM() of 1, 1, 8, and 2 will return 12. When
the DISTINCT option is available, the SUM() of 1, 1, 8, and 2 will return 11 (1 + 8 + 2). It ingores
the one duplicate value (1).
If you use the SUM function in a SELECT statement, it returns NULL not zero in case
the SELECT statement returns no rows.
The following statement uses the SUM() function to calculate the total payment of the customer id
2000:
In these cases, we need to partition the relation into nonoverlapping subsets (or groups) of
tuples. Each group (partition) will consist of the tuples that have the same value of some
attribute(s), called the grouping attribute(s). We can then apply the function to each such group
independently to produce summary information about each group.
The GROUP BY clause specifies the grouping attributes, which should also appear in the
SELECT clause, so that the value resulting from applying each aggregate function to a group of
tuples appears along with the value of the grouping attribute(s).
The following statement illustrates the basic syntax of the GROUP BY clause:
SELECT
column_1, column_2, ..., aggregate_function(column_3)
FROM
table_name
GROUP BY
column_1, column_2, ...;
In this syntax:
First, select the columns that you want to group e.g., column1 and column2, and column that you
want to apply an aggregate function (column3).
Second, list the columns that you want to group in the GROUP BY clause.
The statement clause divides the rows by the values of the columns specified in the GROUP
BY clause and calculates a value for each group.
PostgreSQL evaluates the GROUP BY clause after the FROM and WHERE clauses and before
the HAVING, SELECT, DISTINCT, ORDER BY and LIMIT clauses:
Example 1.
For each department, retrieve the department number, the number of employees in the
department, and their average salary.
FROM EMPLOYEE
GROUP BY Dno;
In Example1, the EMPLOYEE tuples are partitioned into groups—each group having the same
value for the grouping attribute Dno. Hence, each group contains the employees who work in the
same department. The COUNT and AVG functions are applied to each such group of tuples.
Notice that the SELECT clause includes only the grouping attribute and the aggregate functions
to be applied on each group of tuples. Figure 1.1(a) illustrates how grouping works on Example 1;
it also shows the result of query:
Figure 1.1(a)
If NULLs exist in the grouping attribute, then a separate group is created for all tuples with a
NULL value in the grouping attribute. For example, if the EMPLOYEE table had some tuples
that had NULL for the grouping attribute Dno, there would be a separate group for those tuples
in the result of Example1.
Example 2.
For each project, retrieve the project number, the project name, and the number of employees
who work on that project:
WHERE Pnumber=Pno
Example 2 shows how we can use a join condition in conjunction with GROUP BY. In this
case, the grouping and functions are applied after the joining of the two relations.
Sometimes we want to retrieve the values of these functions only for groups that satisfy certain
conditions. For example, suppose that we want to modify Example2 so that only projects with
more than two employees appear in the result.
SQL provides a HAVING clause, which can appear in conjunction with a GROUP BY clause,
for this purpose. HAVING provides a condition on the summary information regarding the
group of tuples associated with each value of the grouping attributes. Only the groups that satisfy
the condition are retrieved in the result of the query.
Since the HAVING clause is evaluated before the SELECT clause, you cannot use column aliases
in the HAVING clause. Because at the time of evaluating the HAVING clause, the column aliases
specified in the SELECT clause are not available.
The WHERE clause allows you to filter rows based on a specified condition. However,
the HAVING clause allows you to filter groups of rows according to a specified condition.
In other words, the WHERE clause is applied to rows while the HAVING clause is applied to
groups of rows.
Example 3. For each project on which more than two employees work, retrieve the project number,
the project name, and the number of employees who work on the project:
SELECT Pnumber, Pname, COUNT (*)
WHERE Pnumber=Pno
Notice that while selection conditions in the WHERE clause limit the tuples to which functions
are applied, the HAVING clause serves to choose whole groups.
Figure 1.1(b) illustrates the use of HAVING and displays the result of Example3.
Figure 1.1(b)
Example 4. For each project, retrieve the project number, the project name, and the number of
employees from department 5 who work on the project:
Here we restrict the tuples in the relation (and hence the tuples in each group) to those that
satisfy the condition specified in the WHERE clause—namely, that they work in department
number 5.
Notice that we must be extra careful when two different conditions apply (one to the aggregate
function in the SELECT clause and another to the function in the HAVING clause).
For example, suppose that we want to count the total number of employees whose salaries exceed
$40,000 in each department, but only for departments where more than five employees work. Here,
the condition (SALARY > 40000) applies only to the COUNT function in the SELECT clause.
GROUP BY Dname
This is incorrect because it will select only departments that have more than five employees who
each earn more than $40,000. The rule is that the WHERE clause is executed first, to select
individual tuples or joined tuples; the HAVING clause is applied later, to select individual
groups of tuples. Hence, the tuples are already restricted to employees who earn more than
$40,000 before the function in the HAVING clause is applied. One way to write this query
correctly is to use a nested query, as shown in Example 5:
To get the number of payments by the customer, you use the GROUP BY clause to group the
payments into groups based on customer id, and use the COUNT() function to count the payments
for each group.
The following query illustrates the idea:
SELECT
customer_id,
COUNT (customer_id)
FROM
payment
GROUP BY
customer_id
order by customer_id;
Result:
Example 2.
For example, the following statement finds customers who have made more than 40 payments:
SELECT
customer_id,
COUNT (customer_id)
FROM
payment
GROUP BY
customer_id
HAVING
COUNT (customer_id) > 40;
Result:
Example 3.
The following example uses the AVG() function with GROUP BY clause to calculate the average
amount paid by each customer:
SELECT
customer_id,
first_name,
last_name,
AVG(amount)::NUMERIC(10,2)
FROM
payment
INNER JOIN customer USING(customer_id)
GROUP BY customer_id
ORDER BY customer_id;
Result:
Example 4.
You can use the AVG function in the HAVING clause to filter the
group based on a certain condition. For example, for all
customers, you can get the customers who paid the average
payment bigger than 5 USD. The following query helps you to do
so:
SELECT
customer_id,
first_name,
last_name,
AVG (amount)::NUMERIC(10,2)
FROM
payment
INNER JOIN customer USING(customer_id)
GROUP BY
customer_id
HAVING
AVG (amount) > 5
ORDER BY
customer_id;
Result:
Example 5.
The following example uses the SUM() function with the GROUP BY clause to calculate the total
amount paid by each customer:
SELECT
customer_id,
SUM (amount) AS total
FROM
payment
GROUP BY
customer_id
ORDER BY total;
Result:
Example 6.
The following query returns top five customers who paid the most:
SELECT
customer_id,
SUM (amount) AS total
FROM
payment
GROUP BY
customer_id
ORDER BY total DESC
LIMIT 5;
Result:
Example 7.
The following example returns the customers who paid more than $200:
SELECT
customer_id,
SUM (amount) AS total
FROM
payment
GROUP BY
customer_id
HAVING SUM(amount) > 200
ORDER BY total DESC
Result:
Example 8.
For each group of (customer_id, staff_id), the SUM() calculates the total amount of money:
SELECT
customer_id, staff_id, SUM(amount)
FROM payment
GROUP BY staff_id, customer_id
ORDER BY customer_id;
Result:
Summary of SQL Queries.
The FROM clause specifies all relations (tables) needed in the query, including joined relations,
but not those in nested queries.
The WHERE clause specifies the conditions for selecting the tuples from these relations,
including join conditions if needed.
GROUP BY specifies grouping attributes, whereas HAVING specifies a condition on the groups
being selected rather than on the individual tuples.
Since the HAVING clause is evaluated before the SELECT clause, you cannot use column
aliases in the HAVING clause. Because at the time of evaluating the HAVING clause, the column
aliases specified in the SELECT clause are not available.
The WHERE clause allows you to filter rows based on a specified condition. However,
the HAVING clause allows you to filter groups of rows according to a specified condition. In
other words, the WHERE clause is applied to rows while the HAVING clause is applied to groups
of rows.
The built-in aggregate functions COUNT, SUM, MIN, MAX, and AVG are used in conjunction
with grouping, but they can also be applied to all the selected tuples in a query without a GROUP
BY clause.
You can use aggregate functions as expressions only in the following clauses: SELECT and
HAVING.
In order to formulate queries correctly, it is useful to consider the steps that define the meaning or
semantics of each query. A query is evaluated conceptually by first applying the FROM clause (to
identify all tables involved in the query or to materialize any joined tables), followed by the
WHERE clause to select and join tuples, and then by GROUP BY and HAVING.
Conceptually, ORDER BY is applied at the end to sort the query result. If none of the last three
clauses (GROUP BY, HAVING, and ORDER BY) are specified, we can think conceptually of a
query as being executed as follows: For each combination of tuples—one from each of the
relations specified in the FROM clause—evaluate the WHERE clause; if it evaluates to TRUE,
place the values of the attributes specified in the SELECT clause from this tuple combination in
the result of the query.