Anjali f9

Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

CHAPTER 1

INTRODUCTION

A large amount of research on nutrition and health is dependent on measurements of dietary


intake and health outcomes. In clinical trials and epidemiological studies in particular,
quantitative methods of data analysis are applied, and this includes data mining. ‗Data mining‘
is the process of seeking knowledge or information from data. The purpose for ‗mining‘ the
data may be to investigate relationships between predictor (e.g. diet) and target (e.g. blood
pressure) variables in order to determine relationships or to make predictions. In some cases,
there may be no particular target variable of interest, and in this case, the purpose of
the mining is to investigate any potential relationships within the data. In industry,
particularly amongst large companies, these methods are used for decision making to improve
efficiency and profit. In the same way, data mining methods can be used in nutrition and
dietetics to improve efficiency of service delivery and support better patient outcomes.

In research on nutrition and health, one of the most important applications of data mining is
to use existing data to help build the evidence base for effective nutritional care. In addition,
data mining in clinical research can help improve trial protocols and health service delivery
models by examining factors such as attrition rates and subgroup responses. Unlike more
traditional methods of investigating associative relationships (such as linear regression), data
mining methods require less need to specify hypotheses a priori and can therefore be
regarded as exploratory in nature.

1.1 Dietary assessment

Dietary assessment was based on a semi-quantitative food- frequency questionnaire (FFQ),


consisting of 154 food items The dietary questionnaires were administered to all the students of
each class during school hours, from February to June of 2005, by the same person—a qualified
nutritionist, according to a written pro- tocol describing a standardized methodology for dietary
data collection. Further details about the dietary assessment are provided elsewhere. For this

work we have used the short FFQ, because it is based on food groups and thus dietary
patterns extracted could be more meaningful than when based on single foods. This was
1
evident after performing sta- tistical analysis with both the long and the short FFQ. Results of
analyses based on the short FFQ were more meaningful. Moreover this short FFQ was used in
the past to extract dietary patterns and dietary indexes and results have been published
elsewhere. Lastly performing characteristics of this short FFQ against the long FFQ were quite
satisfactory.

1.2 Assessment of socio-demographic characteristics

Information on socio-demographic characteristics (i.e., age, gender, size of the family, living in
a refugee camp (yes/no), and parental relationship) was provided by the children; however, data
on some demographic characteristics that could be not provided with sufficient accuracy, such
as parents‘ educational level, income, and occupation, were collected via a short
questionnaire that was completed by the parents. Socioeconomic status (SES) was
assigned based on the parents‘ educational level and occupation.

1.3 Assessment of physical activity

The physical activity index (PAI) was calculated based on two variables that measured
the weekly frequency (0 times, 1–2 times, 3–5 times, and 6–7 times) of all running and walk-
ing activities. The daily average TV (television) viewing time was calculated based on
three variables that measured the time children watch TV on a typical day, on weekends,
as well as the day before the survey. Further details about this index are provided
elsewhere. Additionally, daily average TV (television) viewing time and time sit- ting behind
a computer and playing with electronic games (including weekends) were also assessed in the
children. PAI has been validated against pedometer counts in a sample of 80 children.
Results show that Spearman correla- tions range from rho = 0.280–0.352 for weekdays to
weekends.

In order to increase the survival rate in patients with this condition, it is essential to improve
the decision-making process leading to a better and more efficient selection of treatment
strategies. Nowadays, with the large amount of information present in hospital institutions, it
is possible to use data mining algorithms to improve the healthcare delivery.

2
Many aspects that were previously unknown to healthcare professionals are now being
revealed by the data generated by healthcare, improving the quality of medical procedures
or treatment strategies Healthcare facilities like hospitals produce large amounts of
heterogeneous data every day, since it includes diverse sources, data types and formats. This
heterogeneity of healthcare data leads to the need of a rigorous observation of this data in
order to assess its quality and identify possible problems that need to be solved. Since the data
are so complex, it is practically impossible to analyze it with traditional tools and methods.

This complexity calls for more sophisticated techniques that are able to manage and produce
meaningful knowledge. That way, the healthcare services records can serve as a way of
assessing their quality and the patient‘s satisfaction.

Thus, the use of data technologies like data mining (DM) has become essential in
healthcare. DM is a process that refers to the extraction of useful information from vast
amounts of data. Thus, DM can greatly benefit the healthcare industry by creating an
environment rich in meaningful knowledge. Hence, using DM to support healthcare
professionals in decision-making is crucial to ensure a good healthcare delivery.

In order to provide a deeper understanding of the context and importance of this study, this
section provides the general background related to the associated research field. Thus, some
concepts like knowledge discovery in databases (KDD), machine learning (ML) and DM are
dissected and their association with the healthcare field leads to the introduction of the
clinical decision support systems (CDSS).

1.4 Knowledge Discovery in Databases

Over the years, the rapid growth of digitization and computerization of processes in
health institutions, as well as the large number of transactions that are performed daily
in these environments, led to the production and collection of large amounts of data. This
exponential increase in the amount of data stored by hospital institutions has raised the need to
transform this data into relevant and useful information for the institution, leading to more
efficient decision- making processes. This urgent need of extracting knowledge from the
growing amount of digital data propelled the use of new computational theories and tools. This
3
area is known as KDD.

According to Fayyad et al. the KDD process consists of several phases and begins with the
analysis of the application domain and the objectives to be accomplished, and this
process is divided into 5 phases. he first step of the process is to choose the base to be mined,
which can be data samples, subsets of variables up to large masses of data. The
preprocessing phase aims to eliminate noise, missing values and illegitimate values. The data
transformation step depends on the search objective and the algorithm to be applied, because
it defines the limitations to be imposed on the database.

Improving data quality is important for better results, thus ensuring better quality in
discovered patterns.

Fig. 1 KDD process

After completing the previous phases, DM is applied. This is the most important phase of
the KDD process.

4
1.5 Data Mining

DM is the process of using machine learning techniques and statistical and mathematical
functions to automatically extract potentially useful information from data in a way that is
understandable to users. It can reveal the patterns and relationships among large amounts of
data in a single or several data sets. The knowledge achieved can adopt various forms of
representation, such as equations, trees or graphs, patterns or correlations.

There is a huge amount of data available in the Information Industry. This data is of no use
until it is converted into useful information. It is necessary to analyze this huge amount of data
and extract useful information from it.

Extraction of information is not the only process we need to perform; data mining also involves
other processes such as Data Cleaning, Data Integration, Data Transformation, Data Mining,
Pattern Evaluation and Data Presentation. Once all these processes are over, we would be able
to use this information in many applications such as Fraud Detection, Market Analysis,
Production Control, Science Exploration, etc.

Data Mining is defined as extracting information from huge sets of data. In other words, we can
say that data mining is the procedure of mining knowledge from data. The information or
knowledge extracted so can be used for any of the following applications −

 Market Analysis
 Fraud Detection
 Customer Retention
 Production Control
 Science Exploration
 Data Mining Applications
 Data mining is highly useful in the following domains −
 Market Analysis and Management
 Corporate Analysis & Risk Management
 Fraud Detection

5
The list of functions involved in these processes are as follows

Classification − It predicts the class of objects whose class label is unknown. Its objective is to
find a derived model that describes and distinguishes data classes or concepts. The
Derived Model is based on the analysis set of training data i.e. the data object whose class label
is well known.

Prediction − It is used to predict missing or unavailable numerical data values rather than
class labels. Regression Analysis is generally used for prediction. Prediction can also be used
for identification of distribution trends based on available data.

Outlier Analysis − Outliers may be defined as the data objects that do not comply with
the general behavior or model of the data available.

Evolution Analysis − Evolution analysis refers to the description and model regularities
or trends for objects whose behavior changes over time.

Data Mining Task Primitives

We can specify a data mining task in the form of a data mining query. This query is input to
the system.
A data mining query is defined in terms of data mining task primitives. Representation for
visualizing the discovered patterns.

This refers to the form in which discovered patterns are to be displayed. These representations may
include the following. −

Rules Tables Charts Graphs

Decision Trees

Cubes

Data mining is not an easy task, as the algorithms used can get very complex and data is not
always available at one place. It needs to be integrated from various heterogeneous data
sources. These factors also create some issues. Here in this tutorial, we will discuss the major
issues regarding −
1.6 Mining Methodology and User Interaction

Performance Issues

The following diagram describes the major


issues.
6
Fig. 2 Mining Methodology and User Interaction
Issues
1.7 Mining Methodology and User Interaction
Issues

It refers to the following kinds of issues


Mining different kinds of knowledge in databases − Different users may be interested


in different kinds of knowledge. Therefore it is necessary for data mining to cover a broad
range of knowledge discovery task.

Interactive mining of knowledge at multiple levels of abstraction − The data mining


process needs to be interactive because it allows users to focus the search for patterns,
providing and refining data mining requests based on the returned results.

Incorporation of background knowledge − To guide discovery process and to express


the discovered patterns, the background knowledge can be used. Background knowledge
may be used to express the discovered patterns not only in concise terms but at multiple levels
of abstraction.
7
Data mining query languages and ad hoc data mining − Data Mining Query language
that allows the user to describe ad hoc mining tasks, should be integrated with a data
warehouse query language and optimized for efficient and flexible data mining.

Presentation and visualization of data mining results − Once the patterns are discovered
it needs to be expressed in high level languages, and visual representations. These
representations should be easily understandable.

Handling noisy or incomplete data − The data cleaning methods are required to handle
the noise and incomplete objects while mining the data regularities. If the data cleaning
methods are not there then the accuracy of the discovered patterns will be poor.

Pattern evaluation − The patterns discovered should be interesting because either they
represent common knowledge or lack novelty.

1.8 Performance Issues

There can be performance-related issues such as follows −

Efficiency and scalability of data mining algorithms − In order to effectively extract


the information from huge amount of data in databases, data mining algorithm must be efficient
and scalable.

Parallel, distributed, and incremental mining algorithms − The factors such as huge size
of databases, wide distribution of data, and complexity of data mining methods motivate
the development of parallel and distributed data mining algorithms. These algorithms divide the
data into partitions which is further processed in a parallel fashion. Then the results
from the partitions is merged. The incremental algorithms, update databases without mining the
data again from scratch.

1.9 Diverse Data Types Issues

Handling of relational and complex types of data − The database may contain complex
data objects, multimedia data objects, spatial data, temporal data etc. It is not possible for one
system to mine all these kind of data.

Mining information from heterogeneous databases and global information systems −


The data is available at different data sources on LAN or WAN. These data source may be
structured, semi structured or unstructured. Therefore mining the knowledge from them adds
challenges to data mining.

8
DM methods can be divided into two categories: supervised and unsupervised. The supervised
methods are used to predict a value and require the specification of a target attribute, on the
contrary unsupervised methods are applied to discover the intrinsic structure, patterns, or
affinities between the data. The dentition of the mining technique to be applied is closely
related to the mining task to be performed, as this task define the relationship between the
data, ie the model. DM tasks are the types of discovery to perform in a database, that is, the
information to extract. To determine which task to solve, it is important to have a good
knowledge of the application domain and to know the type of information to obtain.

Therefore, DM includes two main types of techniques: descriptive and predictive. An example
of descriptive techniques are the clustering techniques that are responsible for discovering
information hidden in data. On the other hand, examples of predictive techniques are
classification and regression techniques, that are used to retrieve new information from existing
data.

Thus, there are many applications for DM, since it is greatly adaptable to distinct businesses
and goals. They can go from retail stores, hospitals and banks to insurance or airline
companies. The acquired knowledge during the DM process can also be used to support the
decision-making process in various processes, e.g., in medicine in the diagnosis phase, a correct
and rapid analysis of this large volume of data is important for the identi fication of pathologies.

1.10 Data mining

Data mining is a computational method that permits the extraction of patterns from large
databases. We applied the data mining approach in data in order to derive dietary habits related
to children‘s obesity status. Rules emerged via data mining approach revealed the detrimental
influence of the increased consumption of soft dinks, delicatessen meat, sweets, fried and junk
food. For example, frequent (3–5 times/week) consumption of all these foods increases the risk
for being obese by 75%, whereas in children who have a similar dietary pattern, but eat >2
times/week fish and seafood the risk for obe - sity is reduced by 33%. In conclusion patterns
revealed from data mining technique refer to specific groups of children and demonstrate the
effect on the risk associated with obesity status when a single dietary habit might be modified.

9
Thus, a more individualized approach when translating public health messages could be
achieved.

Data mining (DataM) is also a data driven approach look- ing for patterns that can aid decision
making. While all the above mentioned data driven methods, can be regarded in the broad
sense as data mining techniques, they are regarded as a different group of techniques by the
companies of statistical software.

Therefore, when we use the term data mining in this work we refer to the narrow sense of
the data min- ing. Data mining refers to the extraction of hidden predictive information from
large databases. It is considered is a power- ful technology with great potential to help people
focus on the most important information of their data. However, DataM has been rarely used
so far in nutritional epidemiology while it has been used in biomedical sciences and especially
in the genetics field where the available data banks are large and complex. DataM i s a
process of revealing knowledge, such as patterns, associations, changes, anomalies and
significant structures, from large amounts of data stored in databases, data warehouses, or other
information repositories, via the use of advanced statistical algorithms, modeling and artificial
intelligence. In the field of nutritional epidemiology, where the administration of various
assessment tools and often long questionnaires result in large databases, the use of such a
method may aid to retrieve patterns and thus, help to evalu- ate various associations. One
specific area that the application of DataM technique might be helpful is the epidemiology of
obesity. As it is well known, obesity multi-factorial disease and even though diet is
obviously playing a significant role in the development of this disease, it has been so far
dif ficult to define the degree that certain dietary contribute to its progression . Thus, in this
work we sought to investigate the potential usefulness of DataM approach in revealing dietary
patterns related to obesity status, from the database of the CYKIDS study as compared to a
con- ventional statistical method, that of principal components analysis.

10
CHAPTER 2

LITERATURE SURVEY

As Braun & Clarke, et al., (2006) point out, coding has the potential to go on indefinitely. In
addition to deciding what to code, researchers must also decide when to stop coding. Some
attempt to continue sampling new participants and coding until theoretical saturation is reached.

Glaser Draper & Swift, et al., (2010). The concept of theoretical saturation is rooted
in Grounded Theory where coding and data collection continue in tandem as an iterative
process.

Vermeer et al., (2012). It remains contentious and even some proponents acknowledge
that there is always potential for something new to be found. Theoretical saturation can
therefore more usefully be considered to be more a matter of reaching the point where
the cost of collecting and analyzing any additional data outweighs any benefits.

Lee et al., (2013). applied DM techniques in order to create a prediction process for
the occurrence of postoperative complications on gastric cancer patients. They have
developed artificial neural networks (ANN) and compared their results with those of the
traditional logistic regression (LR) approach, where they‘ve achieved an average correct
classification rate of 84.16% with ANN in contrast with 82.4% of LR.

Polaka et al. (2016) planned various approaches for diagnosing gastric cancer using the
original dataset and datasets with subsets of features. The best results were obtained for the
dataset using attribute subsets selected with the wrapper approach. Four different models were
tested, where C4.5 obtained 74.7% of accuracy, as well as CART. The RIPPER
algorithm produced an accuracy of 73.9%, while the multilayer perceptron got the best results
with 79.6%.

Hosein Zadeh et al. (2017) used an optimized multivariate imputation by chained equations
(MICE) technique to predict the chances of survival in gastric cancer patients. Three different
techniques were executed: the first one, which consisted in the application of logistic regression,
obtained 63.03% of accuracy, while the second technique that used a not optimized MICE
11
algorithm earned an accuracy value of 66.14%. Finally, the third approach with the optimized
MICE algorithm produced results with 72.57% of accuracy.

Mohammadzadeh et al. (2018) carried out a study aimed to develop a decision model for
predicting the probability of mortality in gastric cancer patients also identifying the most
important factors influencing the mortality of patients who suffer from this disease.
Regarding the effective factors on mortality of gastric cancer, the determined factors
were diabetes, ethnicity, tobacco, tumor size, surgery, pathological stage, age at diagnosis,
exposure to chemical weapons and alcohol consumption. The accuracy of developed decision
tree was 74%.
Danial Hooshyar, Margus Pedaste and Yeongwook Yang (2019) there have been extensive
research employing EDM approaches to predict grades or the performance of students in a
course. To do so, surprisingly, they mostly focus on students‘ past performance (e.g., cumulative
GPA) and/or non-academic factors (e.g., gender, age) to build their predictive models without
considering students‘ activity data.

12
CHAPTER 3

REPORT ON THE PRESENT INVESTIGATION

3.1 Methodology and Methods

The reference model used was cross-industry standard process for DM, most commonly
known as CRISP-DM.

The CRISP-DM methodology provides a structured approach to planning a DM project and is


a six phase hierarchical process, divided in the following steps: business understanding, data
understanding, data preparation, modeling, and evaluation and deployment.

Fig. 3 CRISP-DM

13
The analysis of dietary patterns is based on the intake of individual foods. The objective was to
evaluate the usability of supervised data mining methods to predict an aspect of dietary quality
based on dietary intake with a food-based coding system and a novel meal- based coding system.

The measurement of population intakes of foods and nutrients is central to the science of
human nutrition. At present, patterns of dietary intake are studied on a food-by-food basis, given
that the base units for analysis using food composition databases are the individual food
components of every meal. Whereas the use of individual foods for the study of dietary patterns
has served us well, it remains possible that parallel analysis using the nutritional composition of
meals might increase our ability to study diet-disease patterns. The concept of analyzing food
combinations at the meal level is not entirely new. The examination of food combinations at
the meal level provides an approach to deal with the complexity and unpredictability of the
diet and aims to overcome the limitations of the study of nutrients and foods in isolation.

The present article used an entirely new approach to coding meals, the full details of which are
available as ―Supplemental data‖ in the online issue. In brief, the coding system involves a 2
- tiered system with the creation of first-order codes based on foods consumed together as part
of meals. An aggregated second-order coding system per each meal type was then created to
represent the main food constituents per meal. Either of these coding systems can be extended
as a variable beyond a single meal into meal patterns within a day or within the full survey
period. Thus, a new approach allows for the first time an integrated vision of a population‘s
diet, which offers considerable challenges to the analysis of this highly complex data. This
study focuses on the analytic challenges poised by such data structures through the field of data
mining.

Data mining is a process that uses a variety of data analysis tools to discover patterns
and relations in data that may be used to make predictions. Supervised data mining techniques
are used to model an output variable based on one or more input variables, and these models
can be used to predict or forecast future cases. The present article describes and compares 2
supervised meth- ods, artificial neural networks (ANNs) and decision trees. In the past decade,
the use of artificial intelligence has been explored in almost every field of medicine (4). It has
been suggested that physicians will not use any method that purports to improve diagnostic
14
accuracy unless it is easy to use and dramatically and consistently improves their performance
(5). With regard to ANNs, success in the medical field has been widely demon- strated in the
diagnosis and prediction of many illnesses (eg, coronary heart disease and cancer). For
example, an ANN was used to estimate the risk of acute coronary death during 10-y follow-up
in the Prospective Cardiovascular Munster Study (8). Overall, their analysis suggested that the
use of the ANN might allow prevention of 25% of all coronary events in middle-aged men,
compared with 15% with logistic regression. Decision tree algorithms have also been widely
applied in the medical field . However, despite both of their widespread use in the medical
field, thus far there have been no reports in the literature.

Potential applications of such an approach include creating models of known accuracy that can
predict particular health outcomes based on dietary intake variables. The objective of this study
was to explore how 2 com- monthly used supervised data mining techniques might be used in
meal pattern analysis.

3.2 SUBJECTS AND METHODS

Food consumption data

Description of ANN and C5 decision tree models

ANNs differ from many statistical methods in that they usually have more parameters than a
typical model. Many types of neural network exist. For the present study, a model using the
feed- forward back-propagation architecture was used because of its popularity in the
literature. For the remainder of the article this specific type of neural net- work will be
referred to using the more generalized term ANN. Within the software SPSS Clementine, the
user can prevent over- training by randomly selecting 50% of the training data and reserving
the remainder to evaluate the performance of the net- work. By default, this information
determines when to stop train- ing and provides feedback information. To examine the robust-
ness of the ANN models, their predictions were also tested against a training set (2/3 of data
set) and a holdout sample (1/3 of data set), which would normally be completed when they are
trained to run on external data sets for real-life applications. Very similar predictions were
observed between the training and hold- out data sets. Therefore, the model presented here was
15
run on the total data set incorporating all data points, because the aims of this study were solely
of an exploratory nature and will not be extrapolated to other samples.

Another widely used supervised technique is a decision tree, which is a predictive model that, as
its name implies, can be viewed as a tree. For many problems of classification where large data
sets are used and the information contained is complex, decision trees can provide a useful
solution. Similarly, as for neural networks, many varieties of decision trees exist. In the present
study, the C5 algorithm was chosen because it can represent solutions as decision trees and as
rule sets. To ensure that the rules generated were not due to a small number of records, a
minimum of >1% records per child branch was required. The pruning severity of the model was
at the default level of 75, which refers to the minimum number of records in each tree branch to
allow a split. A stratified 10-fold cross-validation method approach was used to estimate the
performance and reliability of the decision tree model. In this ap- proach, the entire data set is
divided into 10 stratified mutually exclusive subsets (folds). Each fold is used once to test the
performance of the model that is generated from the combined data of the remaining 9-folds,
leading to 10 independent performance estimates. Because training of both the ANN and C5
involves an iterative process, all input data were randomly portioned into 3 subsets of equal size:
an internal training set, a verification set, and a validation set.

3.3 Data used in the ANN and C5 Models

So-called first-order and second-order meal codes were used as input variables in the ANN and
C5 models. A detailed de- scription of how these codes were created is provided as
―Supplemental data‖ in the online issue. In brief, the data set was aggregated from the original
food code level to a new meal code level based on the 62 food groups. A newly formed first-
order code was created to represent each meal, and this consisted of a comma-separated list of
all food groups that were consumed at that eating occasion. Because the objective of this
research was to go beyond simple meal descriptions to meal pattern analysis, the data
were then simplified and given a more homogenous structure. Second-order meal codes were
created for 4 meal types (breakfast, light meal, main meal, and snacks), and each code
represents the main food components of the meal.

An important feature of these input variables is that they rep- resent the combination of either
16
foods consumed together as meals or meals consumed together as part of the weekly intake.
Although the analysis was carried out for all 4 meal types, data are only presented here from
the breakfast and main meal types to explore the predictive power of the ANN and C5 models.
For the present analysis, beverages were excluded from the first- and second-order codes
because they were found to dominate all patterns on food combinations at the meal level.
Sugars/sweet- eners were also removed because these were mainly consumed with beverages,
and nutritional supplements were removed be- cause these were not of interest in the present
study. Therefore, 51 food groups formed the input variables. For the first-order codes, the
presence or absence of each food group per meal is represented by a flag field, and it is the flag
fields per food group per meal that are used as input variables for the ―breakfast foods‖ and
―main meal foods‖ data sets For meals (based on the second-order codes), these are joined
together into one comma-separated code to represent all breakfast meals, for example, those
consumed by each respon- dent over the 7 d. Again, flag fields are used, and it is the flag fields
per second-order meal code that are used as input variables for the ―breakfast meals‖ and ―main
meals‖ data sets.

Fig. 4 Example of the first section of the decision tree generated by the C5 algorithm used in predicting quintile
1 or 5 of the Healthy Eating Index (HEI) based on breakfast food intake. For this decision tree, inputs into the
C5 algorithm were each subject‘s first-order codes related to their breakfast meal intakes for the 7 d. Flag fields
(1/0 or yes/no) represent the presence or absence of each food group per first-order code, and these are associated
with the quintile of the HEI related to each subject. The decision tree then splits based on associations between
the presence or absence of each food group and how this predicts the quintile of the HEI.

Four data sets based on aggregated data from the NSIFCS main data set were used in
this analysis: breakfast foods and main meal foods (based on first-order codes) and breakfast
17
meals and main meals (based on second-order codes). The prediction vari- ables were
quintiles (Q) 1 and 5 of a dietary score—the Healthy Eating Index (HEI). The HEI variable
used in this analysis was based on a previous HEI variable created in the NSIFCS (18) and
was calculated according to achieving the 6 following recom- mendations: carbohydrate
intakes S47% of total energy (19), total fat intakes Š33%, saturated fat intakes Š10% of total
en- ergy (20), dietary fiber intakes of 30g/d (21), fruit and vegetable intakes of S400 g/d
(22), and free sugar (nonmilk extrinsic) intakes of Š4 occasions/d (22). A high HEI score
(Q5) refers to a respondent meeting most of the recommendations, and a low score (Q1) refers
to a respondent having low compliance with most recommendations. The exploratory data
sets were limited to information on foods or meal intakes of subjects who belonged to Q1
or Q5 of the HEI. In effect, the present study used a standard a priori score-based approach to
dietary pattern analysis of the extremes of quintiles of the HEI to provide some level of
validation of the proposed methods using data mining techniques.
A major criticism of ANNs is that they are a ―black box‖ approach because they are not very
transparent and have limited ability to explicitly identify possible causal relations. One
approach to solve this is to examine the inputs used in the net- work. Sensitivity analysis was
also derived from the ANN to measure and generate insight into the relative importance for
each of the inputs to the network. These importance values range from 0.0 to 1.0, where
0.0 indicates ―unimportant‖ and 1.0 indicates ―extremely important.‖ Another approach to
understand the rules used to make the predictions made is to run the ANN model through a
separate C5 decision tree and generate a rule set of the predictions made by the ANN model.
In this study, the C5 algorithm was used to generate rules for the ANN prediction of Q1 or Q5
of the HEI for each of the 4 data sets. Percentage accuracy in relation to predictions of Q1 and
Q5 was calculated for all models.

18
Fig. 5 Graphic representation of the basic architecture of the artificial neural networks (ANNs) used in the
present study in relation to the first-order(A) and second-order (B) codes per subject. The first-order codes were
separated out per subject per meal type. The second-order codes are shown for the breakfast meal only per
subject. On the basis of the presence or absence of food groups within each first-order code per subject, the ANN
attempts to predict the output variable after training on a partitioned section of the data set. This training creates
estimated correction weights in the hidden layer. The output layerconsists of 2 response variables: quintiles 1 and
5 of the Healthy Eating Index (HEI)
The most frequent food groups consumed by those in Q1 (lowest quintile) of the HEI were
butter, whole-milk, white bread, and breakfast cereal, whereas the most frequent food groups
consumed by those in Q5 (highest quintile) were breakfast cereals, fruit juice, low-fat milk,
preserves, whole-meal bread, and whole-milk. In relation to the main meal, the most frequent
food groups consumed by those in Q1of the HEI were whole milk, potatoes, green
vegetables and carrots, sauces and dressings and chips, whereas the most frequent food groups
consumed by those in Q5 were potatoes, green vegetables and carrots, sauces and dressings,
low-fat milk, and peas and beans.

The data now moves from foods within first-order codes as input variables to meal codes both
19
at breakfast and the main meal based on the second- order codes. These input variables
relate to their frequency as consumed as part of a respondent‘s weekly dietary intake. Break -
fast meals that were more frequently consumed by respondents who were in Q1 of the HEI
included ―No Breakfast consumed,‖ ―Bread,‖ and ―Bread & Eggs/Meat products.‖ Breakfast
meals such as ―Bread & Breakfast cereal,‖ ―Bread & Fruit/Juice,‖ ―Bread, Breakfast cereal &
Fruit/Juice,‖ and ―Breakfast cereal & Fruit/Juice‖ were more frequently observed in
respondents in Q5 of the HEI. In terms of main meals, those more frequently con- sumed by
respondents in Q1 of the HEI included ―No Main meal,‖ ―Meat/Fish & products & Chips,‖
and ―Meat/Fish & products,‖ ―Chips & Fruit/Veg/Salad,‖ and meals such as ―Po- tatoes,
Veg/Meat & Fruit/Veg/Salad,‖ ―Potatoes, Veg/Meat, Fruit/Veg/Salad & Confect/Snack,‖ and
―Rice/Pasta dish & Fruit/Veg/Salad‖ were more frequently observed in respondents in Q5 of the
HEI.

The 4 models based on the 4 data sets generated by the ANN and C5 decision tree
were evaluated based on the accuracy of predicting Q1 and Q5 of the HEI Overall, the 2
techniques produced models with very similar accuracies. When individual foods per
meal type were considered, the ANN had a slightly higher accuracy in its ability to correctly
predict HEI Q1 and Q5 (78.7% compared with 76.9% and 71.9% compared with 70.1%,
respectively). However, when the meal-coding system was used, the C5 decision tree
displayed higher accuracies in predicting the HEI quintiles (67.5% compared with 54.6% and
75.1% compared with 72.4%, respectively). It was also possible to examine accuracy in
relation to specifically predicting Q1 and Q5 separately, and these are related to the sensitivity
and spec- ificity of the models (ie, accuracy of predicting Q1 is also the sensitivity of the
model, and accuracy in predicting Q5 is also the specificity of the model). The ANN was most
accurate in pre- dicting Q1 (lowest quintile), except when the breakfast meals were inputs in
the model, where the accuracy was only 19.9%. Similarly, the ANN was also more accurate
in predicting Q5, apart from when the main meals were used as the input variables where the
C5 model had a higher accuracy (70.8% compared with 60.9%). Agreement between
predictions of both techniques were compared for the 4 data sets, and the highest
comparability between the 2 was for the breakfast foods data set agreement on.

20
Table 1 – The variables used in application of the data mining method with
their corresponding coding.

Variables used in data mining


procedure

Gender 1: boy 2: girl

Fried food 1 2 3 4

Fish and seafood 1 2 3 4

Delicatessen meat 1 2 3 4

Soft drinks 1 2 3 4

Sweets and junk food 1 2 3 4

Obesity 1: normal 2: overweight/obese

3.4 Anthropometric measurements

Anthropometry data, i.e., weight, height and waist circumference- ence (WC) were
collected from the sub-sample of 634 children (valid values were 622), according to a standard
protocol described by Heyms field Obesity was defined by IOTF age and sex -specific Body
Mass Index (BMI) cut-off crite- ria For this work we have merged as one group overweight
and obese children and compared them against the group of normal children. Thus when using
the term obese, overweight children are included.

Variables selection - As already mentioned above for this work the variables of the FGFQ,
were used. Thus, the fol- lowing food variables were selected: fried food, grilled food,
fish and seafood, meat, delicatessen meat, soft drinks, sweets and junk food, dairy, milk,
yogurt, bread, cereals and grains (excluding bread), fruits and fruit juices legumes and vegeta-
bles. In the appendix we explain which foods are included in each food group. These

21
variables were measured as frequen- cies of foods consumed per week [i.e., fried food: 1:
none (0 per week), 2: few (1–2 per week), 3: many (3–5 per week), and 4: 6–7 per week].
Moreover, the variable ―gender‖ was also consid- ered in order to see if there was any
different pattern between boys and girls, and the variable ―obesity status (yes/no)‖ was used
in combination with the aforementioned variables for the creation of the patterns. Tabel 1
presents the variables included in the data mining procedure.

3.5 Methodology of application. Data mining:

classification by decision tree induction – C4.5: The C4.5 algorithm which is one of the
common data mining techniques was used for this work. The C4.5 algorithm ends up to the
significant rules on the basis of a splitting threshold set and an automatic function of the data
mining software tool which cuts all rules that are without sense, called ―pruning severity‖ .The
splitting threshold can take values from 0 to 1. If the expert wants to extract as many patterns
as can be extracted sets the thresh- old to 0.1 or if the aim is to extract only the most important
patterns, a value near to 1 is given, i.e., 0.9 or 1. For the present work the threshold was set to
0.1.
The classification C4.5 algorithm was run several times for six runs were carried out, for two
factors 15 runs were carried out, for six factors one run were carried out). Many (tens
of hundreds) rules were extracted. The extracted rules between the different runs were
similar. These rules were then stored in a worksheet. Duplicated and meaningless rules were
not considered. Exclusion of the non-meaningful rules was per- formed by the expert of the
field (a health professional). In each run the rules were combined with the same variables
except one. Thus the importance of the particular variable that was excluded each time could
be specified. Rules were extracted from different combinations of variables. A minimum of
one to a maximum of six variables were extracted in the different rules. Extracted rules derived
from the different sub-sets of this study‘s dataset were similar, and were further compared and
analyzed in relation to children‘s obesity status. Similar methodology was applied elsewhere.

22
Table 1 Factors of children’s food composition
Fried food, softFruits andDairy productsMilk, bread
drinks, sweets,vegetables and proteinand cereals
junk food foods

Frequency of weekly consumption of0.727


fried food

Frequency of weekly consumption of0.651 0.165 −0.258


soft

drinks/squash/fruit drinks

Frequency of weekly consumption of0.623 −0.102 0.186


sweets/junk food

Frequency of weekly consumption of0.497 0.373


meat

Frequency of weekly consumption of0.496 0.375 −0.208


grilled food

Frequency of weekly consumption of0.396 0.251 0.385


salted/smoked

Meat

Frequency of weekly consumption of0.741 0.135


fruit/fresh fruit

Juice

Frequency of weekly consumption of 0.723 0.154 0.103

vegetables/fresh vegetable juice

23
Frequency of weekly consumption of legumes −0.196 0.390 0.377

Frequency of weekly consumption of 0.643 0.246


dairy products

Frequency of weekly consumption of 0.190 0.527 0.107


yogurt

Frequency of weekly consumption of0.284 0.391 0.459 −0.317


fish/seafood
0.116 0.637
Frequency of weekly consumption of
milk

Frequency of weekly consumption of0.122 0.284 0.580


bread

Frequency of weekly consumption of0.129 0.309 0.514


cereals/grains

3.6 Multivariate statistical analyses


Continuous variables are presented as mean ± SD, whereas categorical variables are presented
as absolute and rela- tive frequencies. Normality of distribution was tested by
Kolmogorov– Smirnov test. Associations between normally distributed variables were tested
by Student‘s t - test, whereas Mann–Whitney U test was used for non-normally distributed
continuous variables. Associations between categorical vari- ables were tested by
contingency tables and chi-square test without Yate‘s continuity in 2 × 2 tables. Principal
compo- nent analysis (PCA) with varimax rotation was employed to the short FGQ to extract
the main factors of diet composi- tion from the 15 variables assessing children‘s frequency of
food groups. The correlation matrix was used for the extraction of the components. The
information was rotated in order to increase the representation of each food or food group to a
component. Based on the principle that the component scores are interpreted similarly to
correlation coefficients, thus higher absolute values indicate that the food-variable contributes
24
most to the construction of the component, the food– patterns were named according to scores
of the foods that correlated most with the factor (scores> 0.35). In the present analysis 4-
components were extracted based on their eigenvalues (≥1). Afterwards, the extracted factors
were used as independent variables in logistic regression models sepa- rately run for each
gender to determine the association of the children‘s diet composition with their obesity status.
Results, would thus be comparable with those of the data mining pro- cedure, which has not
revealed any significant rules of obesity status to other socio - demographic factors or the
variables of physical activity and presented association rules by gen- der. Finally, logistic
regression analyses were repeated fully adjusting for the potential confounders of age, place of
res- idence (urban/rural), parental BMI class, SES level, PAI index and sedentary behaviors (as
assessed by TV viewing time).

All reported p values are based on two-tailed tests and compared to a significance level of
5%. SPSS 13.0 software (Sta- tistical Package for Social Sciences, Chicago, IL, USA) was
used for all statistical calculations.

25
CHAPTER 4

RESULTS AND DISCUSSIONS

RESULTS
The results demonstrate that both methods can be used to develop models that possess a high
degree of predictive accuracy based on dietary intake data. Overall, the first-order
breakfast foods offered a better predictive accuracy for the HEI than did the second-order
breakfast meal codes. This was highlighted with a predictive accuracy of only 19.9% by the
ANN in predicting Q5 of the HEI when breakfast meals were used as inputs.

DISCUSSION

The use of supervised artificial intelligence methods, such as ANNs and decision trees, in the
medical field is a newly emerg- ing phenomenon. However, to our knowledge, no studies
have yet applied these analytic tools in the realm of nutrition. In this discipline there are
prevailing gaps in studies between foods consumed as per nutritional surveillance surveys (on
which there is a large amount of knowledge) and methods on analyzing pat- terns of food
combinations and on how such combinations in- fluence overall dietary quality (on which
there is limited knowl- edge). Therefore the work presented herein attempts to reduce this
gap and illustrates the potential usefulness of artificial intel- ligence, particularly ANNs and
decision trees in predicting an aspect of diet quality based on food intake at the meal level.

26
CHAPTER 5

CONCLUSION AND FUTURE WORK

The present study has shown that both ANNs and C5 decision trees can be used to predict an
aspect of dietary quality, the HEI, using foods consumed together at meals or using meals
consumed together over the survey period. Both data mining techniques had similar accuracies
in relation to their prediction of HEI quintiles; with both approaches, similar foods and meals
were found to be important predictors. The sophistication of these techniques in providing a
means to predict a dietary vari- able such as the HEI with a considerable degree of accuracy
offers a feasible alternative approach to more traditional methods. However, further
exploration of the use of ANNs and deci- sion trees in nutritional science is warranted to
appreciate the potential value of this statistical approach in predicting dietary behavior.

27
REFRENCES

 Draper, A. & Swift, J.A. (2010) Qualitative research in nutrition and dietetics: data
collection issues. J. Hum. Nutr. Diet., in press. Fade, S. (2004) Using interpretative
phenomenological analysis for public health nutrition and dietetic research: a practical
guide. Proc. Nutr. Soc. 63, 647–653.
 Gibbs, G. (2007) Analyzing Qualitative Data: Sage Qualitative Research Kit. London:
Sage Publications.
 Glaser, B. (1992) Basics of Grounded Theory Analysis. Mill Valley, CA: Sociology
Press.
 Glaser, B.G. & Strauss, A.L. (1967) The Discovery of Grounded Theory. New York, NY:
Aldine.
 Gough, B. & Conner, M.T. (2006) Barriers to health eating amongst men: a qualitative
analysis. Soc. Sci. Med. 62, 387–395.
 Green, J. & Thorogood, N. (2004) Qualitative Methods for Health Research. London:
Sage Publications.
 Lewins, A. & Silver, C. (2007) Using Software in Qualitative Research: A Step by Step
Guide. London: Sage Publications.
 Mason, J. (1996) Qualitative Researching, 1st edn. London: Sage Publications.
 Miles, M. & Huberman, M. (1984) Qualitative Data Analysis: A Source Book for New
Methods. Thousand Oaks, CA: Sage Publications.
 Morse, J. (2008) Confusing categories and themes. Qual.Health Res. 18, 727–728.
Popay, J., Rogers, A. & Williams, G. (1998) Rationale and standards for the systematic
review of qualitative literature in health services research. Qual. Health Res. 8, 341–351.
 Pope, C., Ziebland, S. & Mays, N. (2000) Qualitative research in healthcare: analysing
qualitative data. Br. Med. J. 320, 114–116.
 Riessman, C.K. (2004) Narrative analysis. In The Sage Encyclopaedia of Social Science
Research Methods (Vols. 1–3).
 Robson, C. (1993) Real World Research: A Resource for Social Scientists and
Practitioner Researchers, 1st edn. Oxford: Blackwell Publishing.

28
 Silverman, D. (1997) Qualitative Research. Thousand Oaks, CA: Sage Publications.
 Smith, J.A., Flowers, P. & Larkin, M. (2009) Interpretative Phenomenological Analysis.
Theory, Method and Research. London: Sage Publications.
 Strauss, A. (1987) Qualitative Analysis for Social Scientists. Cambridge: Cambridge
University Press.
 Strauss, A. & Corbin, J. (1998) Basics of Qualitative Research. Techniques and
Procedures for Developing Grounded Theory. Thousand Oaks, CA: Sage Publications.

29

You might also like