0% found this document useful (0 votes)
46 views20 pages

Lecture Week6

The document discusses grouping and aggregate functions in SQL. It provides examples of how to use various aggregate functions like COUNT, SUM, MAX, MIN, and AVG to summarize and retrieve aggregate values from data that has been grouped. It explains that the GROUP BY clause is used to divide rows into groups based on one or more columns before applying aggregate functions to each group. Built-in aggregate functions and examples of retrieving summary statistics like counts, sums, maximums, minimums and averages from single relations or subsets defined by filters are provided.

Uploaded by

A Dan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
46 views20 pages

Lecture Week6

The document discusses grouping and aggregate functions in SQL. It provides examples of how to use various aggregate functions like COUNT, SUM, MAX, MIN, and AVG to summarize and retrieve aggregate values from data that has been grouped. It explains that the GROUP BY clause is used to divide rows into groups based on one or more columns before applying aggregate functions to each group. Built-in aggregate functions and examples of retrieving summary statistics like counts, sums, maximums, minimums and averages from single relations or subsets defined by filters are provided.

Uploaded by

A Dan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 20

LECTURE WEEK6

GROUPING DATA. AGGREGATE


FUNCTIONS.

Grouping data.
Aggregate Functions in SQL.

Grouping is used to create subgroups of tuples before summarization. Grouping and aggregation
are required in many database applications, and we will introduce their use in SQL through
examples.

Aggregate functions are used to summarize information from multiple tuples into a single-tuple
summary. In other words, aggregate functions perform a calculation on a set of rows and return a
single row.

A number of built-in aggregate functions exist:

- COUNT – returns the number of values

- SUM – returns the sum of values

- MAX – returns the maximum value

- MIN – returns the minimum value

- and AVG – returns the average (mean) of values.

We often use the aggregate functions with the GROUP BY clause in the SELECT statement. In
these cases, the GROUP BY clause divides the result set into groups of rows and the aggregate
functions perform a calculation on each group e.g., maximum, minimum, average, etc.

These functions can be used in the SELECT clause or in a HAVING clause (which we introduce
later).

The functions MAX and MIN can also be used with attributes that have nonnumeric domains if
the domain values have a total ordering among one another (for example, with dates).

COMPANY DB structure
Example 1.
Find the sum of the salaries of all employees, the maximum salary, the minimum salary, and the
average salary:

SELECT SUM (Salary), MAX (Salary), MIN (Salary), AVG (Salary)

FROM EMPLOYEE;

If we want to get the preceding function values for employees of a specific department—say, the
‘Research’ department—we can write query, where the EMPLOYEE tuples are restricted by the
WHERE clause to those employees who work for the ‘Research’ department.

Example 2.

Find the sum of the salaries of all employees of the ‘Research’ department, as well as the
maximum salary, the minimum salary, and the average salary in this department:

SELECT SUM (Salary), MAX (Salary), MIN (Salary), AVG (Salary)

FROM EMPLOYEE JOIN DEPARTMENT ON Dno=Dnumber

WHERE Dname=‘Research’;

Example 3.

Retrieve the total number of employees in the company:

SELECT COUNT (*)

FROM EMPLOYEE;

Here the asterisk (*) refers to the rows (tuples), so COUNT (*) returns the number of rows in the
result of the query. We may also use the COUNT function to count values in a column rather
than tuples, as in the next example.

Example 4.

Count the number of distinct salary values in the database:

SELECT COUNT (DISTINCT Salary)

FROM EMPLOYEE;

The preceding examples summarize a whole relation (Example1, Example3, Example4) or a


selected subset of tuples (Example2), and hence all produce single tuples or single values. They
illustrate how functions are applied to retrieve a summary value or summary tuple from the
database.

Aggregate functions: COUNT()


The COUNT() function is an aggregate function that allows you to get the number of rows that
match a specific condition of a query.

The following statement illustrates various ways of using the COUNT() function.

COUNT(*)

The COUNT(*) function returns the number of rows returned by a SELECT statement, including
NULL and duplicates:

SELECT
COUNT(*)
FROM
table_name
WHERE
condition;

When you apply the COUNT(*) function to the entire table, PostgreSQL has to scan the whole
table sequentially. If you use the COUNT(*) function on a big table, the query will be slow.

COUNT(column)

Similar to the COUNT(*) function, the COUNT(column) function returns the number of rows returned
by a SELECT clause. However, it does not consider NULL values in the column:

SELECT
COUNT(column)
FROM
table_name
WHERE
condition;

COUNT(DISTINCT column)

In this form, the COUNT(DISTINCT column) returns the number of unique non-null values in
the column.

SELECT
COUNT(DISTINCT column)
FROM
table_name
WHERE
condition;
We often use the COUNT() function with the GROUP BY clause to return the number of items
for each group. For example, we can use the COUNT() with the GROUP BY clause to return the
number of films in each film category.

PostgreSQL COUNT() function examples

1) PostgreSQL COUNT(*) example

The following statement uses the COUNT(*) function to return the number of transactions in
the payment table:

SELECT
COUNT(*)
FROM
payment;

Result:

2) PostgreSQL COUNT(DISTINCT column) example

To get the distinct amounts which customers paid, you use the COUNT(DISTINCT amount) function
as shown in the following example:

SELECT
COUNT(DISTINCT amount)
FROM
payment;

Result:

PostgreSQL COUNT() with GROUP BY clause

To get the number of payments by the customer, you use the GROUP BY clause to group the
payments into groups based on customer id, and use the COUNT() function to count the payments
for each group.
The following query illustrates the idea:

SELECT
customer_id,
COUNT (customer_id)
FROM
payment
GROUP BY
customer_id;

Result:

Aggregate functions: AVG()


The AVG() function is one of the most commonly used aggregate functions in PostgreSQL.
The AVG() function allows you to calculate the average value of a set.

If you want to know the average amount that customers paid, you can apply the AVG function on
the amount column as the following query:

SELECT AVG(amount)
FROM payment;

Result:

To make the output more readable:

SELECT ROUND(AVG(amount))
FROM payment;
Or

SELECT AVG(amount)::numeric(8,2)
FROM payment;

Aggregate functions: MAX()


PostgreSQL MAX function is an aggregate function that returns the maximum value in a set of
values. The MAX function is useful in many cases. For example, you can use the MAX function to
find the employees who have the highest salary or to find the most expensive products, etc.

The following query uses the MAX() function to find the highest amount paid by customers in
the payment table:

SELECT MAX(amount)
FROM payment;

Result:

To get information of the customer with the highest payment, you use a subquery as follows:

select first_name, last_name, amount


from customer c join payment p on c.customer_id = p.customer_id
where amount = (select max(amount) from payment);

Result:
The following diagram illustrates the steps that PostgreSQL performs the query:

Aggregate functions: MIN()


PostgreSQL MIN() function an aggregate function that returns the minimum value in a set of
values.

To find the minimum value in a column of a table, you pass the name of the column to
the MIN() function. The data type of the column can be number, string, or any comparable type.

The following example uses the MIN() function to get the lowest rental rate from
the rental_rate column the film table:

SELECT
MIN (rental_rate)
FROM
film;

Result:

To get films which have the lowest rental rate, you use the following query:
SELECT
film_id,
title,
rental_rate
FROM
film
WHERE
rental_rate = (
SELECT MIN(rental_rate)
FROM film
);

Result:

Aggregate functions: SUM()


The PostgreSQL SUM() is an aggregate function that returns the sum of values or distinct values.

The SUM() function ignores NULL. It means that SUM() doesn’t consider the NULL in calculation.

If you use the DISTINCT option, the SUM() function calculates the sum of distinct values.

For example, without the DISTINCT option, the SUM() of 1, 1, 8, and 2 will return 12. When
the DISTINCT option is available, the SUM() of 1, 1, 8, and 2 will return 11 (1 + 8 + 2). It ingores
the one duplicate value (1).

If you use the SUM function in a SELECT statement, it returns NULL not zero in case
the SELECT statement returns no rows.

The following statement uses the SUM() function to calculate the total payment of the customer id
2000:

SELECT SUM (amount) AS total


FROM payment
WHERE customer_id = 2000;
Result:

Grouping: The GROUP BY and HAVING Clauses


In many cases we want to apply the aggregate functions to subgroups of tuples in a relation,
where the subgroups are based on some attribute values. For example, we may want to find the
average salary of employees in each department or the number of employees who work on each
project.

In these cases, we need to partition the relation into nonoverlapping subsets (or groups) of
tuples. Each group (partition) will consist of the tuples that have the same value of some
attribute(s), called the grouping attribute(s). We can then apply the function to each such group
independently to produce summary information about each group.

PostgreSQL has a GROUP BY clause for this purpose.

The GROUP BY clause specifies the grouping attributes, which should also appear in the
SELECT clause, so that the value resulting from applying each aggregate function to a group of
tuples appears along with the value of the grouping attribute(s).

The following statement illustrates the basic syntax of the GROUP BY clause:

SELECT
column_1, column_2, ..., aggregate_function(column_3)
FROM
table_name
GROUP BY
column_1, column_2, ...;

In this syntax:

 First, select the columns that you want to group e.g., column1 and column2, and column that you
want to apply an aggregate function (column3).
 Second, list the columns that you want to group in the GROUP BY clause.

The statement clause divides the rows by the values of the columns specified in the GROUP
BY clause and calculates a value for each group.

PostgreSQL evaluates the GROUP BY clause after the FROM and WHERE clauses and before
the HAVING, SELECT, DISTINCT, ORDER BY and LIMIT clauses:
Example 1.

For each department, retrieve the department number, the number of employees in the
department, and their average salary.

SELECT Dno, COUNT (*), AVG (Salary)

FROM EMPLOYEE

GROUP BY Dno;

In Example1, the EMPLOYEE tuples are partitioned into groups—each group having the same
value for the grouping attribute Dno. Hence, each group contains the employees who work in the
same department. The COUNT and AVG functions are applied to each such group of tuples.
Notice that the SELECT clause includes only the grouping attribute and the aggregate functions
to be applied on each group of tuples. Figure 1.1(a) illustrates how grouping works on Example 1;
it also shows the result of query:

Figure 1.1(a)
If NULLs exist in the grouping attribute, then a separate group is created for all tuples with a
NULL value in the grouping attribute. For example, if the EMPLOYEE table had some tuples
that had NULL for the grouping attribute Dno, there would be a separate group for those tuples
in the result of Example1.

Example 2.

For each project, retrieve the project number, the project name, and the number of employees
who work on that project:

SELECT Pnumber, Pname, COUNT (*)

FROM PROJECT, WORKS_ON

WHERE Pnumber=Pno

GROUP BY Pnumber, Pname;

Example 2 shows how we can use a join condition in conjunction with GROUP BY. In this
case, the grouping and functions are applied after the joining of the two relations.

Sometimes we want to retrieve the values of these functions only for groups that satisfy certain
conditions. For example, suppose that we want to modify Example2 so that only projects with
more than two employees appear in the result.

SQL provides a HAVING clause, which can appear in conjunction with a GROUP BY clause,
for this purpose. HAVING provides a condition on the summary information regarding the
group of tuples associated with each value of the grouping attributes. Only the groups that satisfy
the condition are retrieved in the result of the query.

Since the HAVING clause is evaluated before the SELECT clause, you cannot use column aliases
in the HAVING clause. Because at the time of evaluating the HAVING clause, the column aliases
specified in the SELECT clause are not available.

HAVING vs. WHERE

The WHERE clause allows you to filter rows based on a specified condition. However,
the HAVING clause allows you to filter groups of rows according to a specified condition.

In other words, the WHERE clause is applied to rows while the HAVING clause is applied to
groups of rows.

Example 3. For each project on which more than two employees work, retrieve the project number,
the project name, and the number of employees who work on the project:
SELECT Pnumber, Pname, COUNT (*)

FROM PROJECT, WORKS_ON

WHERE Pnumber=Pno

GROUP BY Pnumber, Pname

HAVING COUNT (*) > 2;

Notice that while selection conditions in the WHERE clause limit the tuples to which functions
are applied, the HAVING clause serves to choose whole groups.

Figure 1.1(b) illustrates the use of HAVING and displays the result of Example3.

Figure 1.1(b)

Example 4. For each project, retrieve the project number, the project name, and the number of
employees from department 5 who work on the project:

SELECT Pnumber, Pname, COUNT (*)

FROM PROJECT, WORKS_ON, EMPLOYEE

WHERE Pnumber=Pno AND Ssn=Essn AND Dno=5

GROUP BY Pnumber, Pname;

Here we restrict the tuples in the relation (and hence the tuples in each group) to those that
satisfy the condition specified in the WHERE clause—namely, that they work in department
number 5.

Notice that we must be extra careful when two different conditions apply (one to the aggregate
function in the SELECT clause and another to the function in the HAVING clause).
For example, suppose that we want to count the total number of employees whose salaries exceed
$40,000 in each department, but only for departments where more than five employees work. Here,
the condition (SALARY > 40000) applies only to the COUNT function in the SELECT clause.

Suppose that we write the following query:

SELECT Dname, COUNT (*)

FROM DEPARTMENT, EMPLOYEE

WHERE Dnumber=Dno AND Salary>40000

GROUP BY Dname

HAVING COUNT (*) > 5;

This is incorrect because it will select only departments that have more than five employees who
each earn more than $40,000. The rule is that the WHERE clause is executed first, to select
individual tuples or joined tuples; the HAVING clause is applied later, to select individual
groups of tuples. Hence, the tuples are already restricted to employees who earn more than
$40,000 before the function in the HAVING clause is applied. One way to write this query
correctly is to use a nested query, as shown in Example 5:

SELECT Dnumber, COUNT (*)

FROM DEPARTMENT, EMPLOYEE

WHERE Dnumber=Dno AND Salary>40000 AND


(SELECT Dno
FROM EMPLOYEE
GROUP BY Dno

HAVING COUNT (*) > 5) ;

GROUP BY and HAVING examples:


Example 1.

To get the number of payments by the customer, you use the GROUP BY clause to group the
payments into groups based on customer id, and use the COUNT() function to count the payments
for each group.
The following query illustrates the idea:

SELECT
customer_id,
COUNT (customer_id)
FROM
payment
GROUP BY
customer_id
order by customer_id;

Result:

Example 2.

For example, the following statement finds customers who have made more than 40 payments:

SELECT
customer_id,
COUNT (customer_id)
FROM
payment
GROUP BY
customer_id
HAVING
COUNT (customer_id) > 40;

Result:

Example 3.

The following example uses the AVG() function with GROUP BY clause to calculate the average
amount paid by each customer:

SELECT
customer_id,
first_name,
last_name,
AVG(amount)::NUMERIC(10,2)
FROM
payment
INNER JOIN customer USING(customer_id)
GROUP BY customer_id
ORDER BY customer_id;

Result:

Example 4.

You can use the AVG function in the HAVING clause to filter the
group based on a certain condition. For example, for all
customers, you can get the customers who paid the average
payment bigger than 5 USD. The following query helps you to do
so:

SELECT
customer_id,
first_name,
last_name,
AVG (amount)::NUMERIC(10,2)
FROM
payment
INNER JOIN customer USING(customer_id)
GROUP BY
customer_id
HAVING
AVG (amount) > 5
ORDER BY
customer_id;

Result:

Example 5.
The following example uses the SUM() function with the GROUP BY clause to calculate the total
amount paid by each customer:

SELECT
customer_id,
SUM (amount) AS total
FROM
payment
GROUP BY
customer_id
ORDER BY total;

Result:

Example 6.

The following query returns top five customers who paid the most:

SELECT
customer_id,
SUM (amount) AS total
FROM
payment
GROUP BY
customer_id
ORDER BY total DESC
LIMIT 5;

Result:

Example 7.

The following example returns the customers who paid more than $200:
SELECT
customer_id,
SUM (amount) AS total
FROM
payment
GROUP BY
customer_id
HAVING SUM(amount) > 200
ORDER BY total DESC

Result:

Example 8.

The following example uses multiple columns in the GROUP BY clause.

For each group of (customer_id, staff_id), the SUM() calculates the total amount of money:

SELECT
customer_id, staff_id, SUM(amount)
FROM payment
GROUP BY staff_id, customer_id
ORDER BY customer_id;

Result:
Summary of SQL Queries.

A retrieval query in SQL can consist of up to six clauses:

SELECT <attribute and function list>

FROM <table list>


[ WHERE <condition> ]
[ GROUP BY <grouping attribute(s)> ]

[ HAVING <group condition> ]

[ ORDER BY <attribute list> ];

The SELECT clause lists the attributes or functions to be retrieved.

The FROM clause specifies all relations (tables) needed in the query, including joined relations,
but not those in nested queries.

The WHERE clause specifies the conditions for selecting the tuples from these relations,
including join conditions if needed.

GROUP BY specifies grouping attributes, whereas HAVING specifies a condition on the groups
being selected rather than on the individual tuples.

Since the HAVING clause is evaluated before the SELECT clause, you cannot use column
aliases in the HAVING clause. Because at the time of evaluating the HAVING clause, the column
aliases specified in the SELECT clause are not available.

The WHERE clause allows you to filter rows based on a specified condition. However,
the HAVING clause allows you to filter groups of rows according to a specified condition. In
other words, the WHERE clause is applied to rows while the HAVING clause is applied to groups
of rows.

The built-in aggregate functions COUNT, SUM, MIN, MAX, and AVG are used in conjunction
with grouping, but they can also be applied to all the selected tuples in a query without a GROUP
BY clause.

You can use aggregate functions as expressions only in the following clauses: SELECT and
HAVING.

Finally, ORDER BY specifies an order for displaying the result of a query.

In order to formulate queries correctly, it is useful to consider the steps that define the meaning or
semantics of each query. A query is evaluated conceptually by first applying the FROM clause (to
identify all tables involved in the query or to materialize any joined tables), followed by the
WHERE clause to select and join tuples, and then by GROUP BY and HAVING.

Conceptually, ORDER BY is applied at the end to sort the query result. If none of the last three
clauses (GROUP BY, HAVING, and ORDER BY) are specified, we can think conceptually of a
query as being executed as follows: For each combination of tuples—one from each of the
relations specified in the FROM clause—evaluate the WHERE clause; if it evaluates to TRUE,
place the values of the attributes specified in the SELECT clause from this tuple combination in
the result of the query.

You might also like