Caps DM
Caps DM
Abstract
We propose an extension of the Cross Industry Standard Process for Data Mining (CRISP-
DM) which addresses specific challenges of machine learning and data mining for context and
model reuse handling. This new general context-aware process model is mapped with CRISP-DM
reference model proposing some new or enhanced outputs.
Keywords: data mining, reframing, context awareness, process model, methodology.
1
1 Introduction
Anticipating potential changes in context is a critically important part of data mining projects. Un-
foreseen context changes can lead to substantial additional costs and in the extreme case require
running a new project from scratch. For example, an automatic text summarisation system de-
veloped in the context of the English language can be extremely hard to be modified for other
languages, unless such context change is anticipated. For another example, a fraud detection ser-
vice provider develops its detectors in the context of known types of frauds, but the context keeps
changing, with new types invented continuously. A careful analysis can help to build more versatile
detectors which are effective for some new types of frauds and are easy to update for other new
types. As a third example, a customer segmentation system helping to tailor products for different
customer groups might be hard to modify to incorporate richer customer information, unless such
context changes are anticipated.
Main methodology for Data Mining
50 Year
2002
2004
40
2007
2014
30
%
20
10
0
CRISP−DM My own Other SEMMA My organization's None
Methodology
Context anticipation is more than just a single separate task and it requires dedicated activities
in all phases of the data mining process, from the initial domain understanding up to the final de-
ployment. These activities are not included in any of the existing Data Mining (DM) standard process
methodologies, such as the Knowledge Discovery in Databases (KDD) Process Fayyad et al. (1996a),
the Cross Industry Standard Process for Data Mining (CRISP-DM) Chapman et al. (2000) and the
Sample, Explore, Modify, Model and Assess (SEMMA)SAS (2005) process model. In this paper, we
report on an extension of the CRISP-DM process model called CASP-DM (Context-Aware Standard
Process for Data Mining), which has been evolving as a new standard with the goal of integrat-
ing context-awareness and context changes in the knowledge discovery process, while remaining
backward compatible, so that users of CRISP-DM can adopt CASP-DM easily.
The reasons why we have use CRISP-DM as a base are multiple. CRISP-DM is the most complete
data mining methodology in terms of meeting the needs of industrial projects and has become the
most widely used process for DM projects, according to the KDnuggets polls held in 2002, 2004,
2007, and 2014. Although CRISP-DM does not seem to be maintained1 or adapted to the new
1
The original crisp-dm.org site is no longer active.
2
challenges in data mining, the proposed six phases and their subphases are still a good guide for
the knowledge discovery process. In fact, the interest in CRISP-DM continues to be high compared
to other models (see Figures 1 and 2). Therefore, the participation and cooperation of the data
mining community is, of course, pivotal to the success of CASP-DM. This inclusion should imply
the development of a platform where the data mining community can have access to the standard,
which otherwise has the risk of being diluted, while working as an embryo for a committee and
stable working group for an evolving standard accommodating future challenges and evolution
of the field. Furthermore, CRISP-DM is supported by several project management software tools,
such as RapidMiner2 and IBM SPSS Modeler3 . The extension of CRISP-DM into CASP-DM allows
data mining projects to become context-aware while keep using these tools.
Figure 2: Relative interest over time in web searches according to Google Trends (www.google.es/
trends/). Terms legend: CRISP-DM in blue, KDD in red, SEMMA in green (the latter having a relative
interest close to zero).
The rest of the document is organised as follows. Section 2 briefly reviews CRISP-DM and re-
lated methodologies, and the state of the art in terms of standardisation and maintenance of the
methology. Section 3 discusses the role that context (or domain) is having in DM applications and
the main types of context and context changes (including changes in costs, data distribution and
others). Section 4 proposes CASP-DM, with new tasks and outputs as well as enhancements to the
original reference model thus allowing the practitioners to be aware of (and anticipate) the main
types of context. Finally, section 5 closes the paper.
3
Figure 3: Evolution of DM Methodologies. Adapted from (Mariscal et al., 2010)
• Developing an understanding of the application domain, the relevant prior knowledge and
the goals of the end-user.
• Creating a target data set: selecting a data set, or focusing on a subset of variables, or data
outputs must be specified but also the way in which the tasks must be carried out.
5
Involving search or inference.
4
samples, on which discovery is to be performed.
• Data cleaning and preprocessing: including basic operations for removing noise or outliers,
collecting necessary information to model or account for noise, deciding on strategies for han-
dling missing data fields, and accounting for time sequence information and known changes.
• Data reduction and projection: including finding useful features to represent the data de-
pending on the goal of the task, using dimensionality reduction or transformation methods to
reduce the effective number of variables under consideration or to find invariant representa-
tions for the data.
• Choosing the data mining task: deciding whether the goal of the KDD process is classification,
regression, clustering, etc.
• Choosing the data mining algorithm(s): selecting method(s) to be used for searching for
patterns in the data, deciding which models and parameters may be appropriate and matching
a particular data mining method with the overall criteria of the KDD process.
• Data mining: searching for patterns of interest in a particular representational form or a set
of such representations as classification rules or trees, regression, clustering, and so forth.
• Consolidating discovered knowledge: incorporating the discovered knowledge into the per-
formance systems.
The different phases in the KDD process are outlined in Figure 2.1 where we see a large amount
of unnecessary loops between steps and a lack of business guidance.
Figure 4: An Overview of the steps of the KDD Process (from Fayyad et al. (1996a))
Several other process models and methodologies have been developed using the KDD approaches
as a basis. The Human-Centered Approach to Data Mining is presented in (Brachman and Anand,
1996; Gertosio and Dussauchoy, 2004). This proposal involves a holistic understanding of the
5
entire Knowledge Discovery Process and involves eight steps: human resource identification, prob-
lem specification, problem specification, data prospecting, methodology identification, data prepro-
cessing, pattern discovery, and knowledge post-processing. It considers peopleâĂŹs involvement
and interpretation in each process and put emphasis on that the target user is the data engineer.
SEMMA (SAS, 2005), which that stands for Sample, Explore, Modify, Model and Assess, is the
methodology that the SAS institute6 proposed for developing DM products. Although it is a method-
ology, it is based only on the technical part of the project and integrated into SAS tools such as
Enterprise Miner. Unlike the former KDD process, SEMMA is not an open process and can only be
used in these tools. The steps of SEMMA are mainly focussed on the modeling tasks of DM projects,
leaving the business aspects. The steps are the following: sample, explore, modify, model and
assess.
The two models by (Cabena et al., 1998) and (Anand and Büchner, 1998; Anand et al., 1998;
Buchner et al., 1999) are based on KDD process with not big differences and with similar features.
The former structures the process in a different number of steps (business objectives determina-
tion, selection, preprocessing and transformation, data mining, analysis of results and assimilation
of knowledge) and was used more in the marketing and sales domain, this being one of the first
process models which took into account the business objectives. For its part, the latter process
model is adapted to web mining projects and focused on an online customer (incorporating the
available operational and materialized data as well as marketing knowledge). The model consists
of eight steps: human resource identification, problem specification, problem specification, data
prospecting, methodology identification, data preprocessing, pattern discovery, and knowledge
post-processing. Although it provides a detailed analysis for the initial steps, it does not include
information on using the obtained knowledge.
The Two Crows Edelstein (1998) is a process model proposed by Two Crows Consulting7 and
takes advantage of some insights from (first versions of) CRISP-DM (before release). It proposes
a non-linear list of steps (very close to the KDD phases), so it is necessary to go back and forth
and . The basic steps of data mining for knowledge discovery are: define business problem, build
data mining database, explore data, prepare data for modeling, build model, evaluate model, deploy
model and results.
6
Domain # Phases
Pre Interpretation/
KDD Academic 5 Selection Transformation Data Mining
processing Evaluation
Data
Choosing Choosing Consolidating
Developing and Understanding of the Creating a Target Data Cleaning Data Reduction Data Interpreting Mined
KDD Fayyad Academic 9 the DM the DM Discovered
Application Domain Set and Pre- and Projection Mining Patterns
Task Algorithm Knowledge
processing
5 A’s Industry 5 Asses Access Analyse Act Automate
Measure, Analyze, Improve, and Control). This methodology has proven to be successful in com-
panies such as IBM, Microsoft, General Electric, Texas Instrument or Ford.
KDD Roadmap (Debuse et al., 2001) is an iterative data mining methodology methodology used
in Witness Miner toolkit9 which uses a visual stream-based interface to represent routes through
the KDD roadmap (consisting of eight steps: problem specification, resourcing, data cleansing,
preprocessing, data mining, evaluation, interpretation and exploitation). The main contribution of
KDD roadmap is the resourcing task which consist in the integration of databases from multiple
sources to form the operational database.
7
Figure 6: Process diagram showing the relationship between the different phases of CRISP-DM
1. Business understanding: This initial phase focuses on understanding the project objectives
and requirements from a business perspective, then converting this knowledge into a data
mining problem definition and a preliminary plan designed to achieve the objectives.
2. Data understanding: The data understanding phase starts with an initial data collection and
proceeds with activities in order to get familiar with the data, to identify data quality problems,
to discover first insights into the data or to detect interesting subsets to form hypotheses for
hidden information.
3. Data preparation: The data preparation phase covers all activities to construct the final dataset
from the initial raw data. Data preparation tasks are likely to be performed multiple times and
not in any prescribed order. Tasks include table, record and attribute selection as well as
transformation and cleaning of data for modeling tools.
4. Modeling: In this phase, various modeling techniques are selected and applied and their pa-
rameters are calibrated to optimal values. Typically, there are several techniques for the same
data mining problem type. Some techniques have specific requirements on the form of data.
Therefore, stepping back to the data preparation phase is often necessary.
5. Evaluation: At this stage the model (or models) obtained are more thoroughly evaluated and
the steps executed to construct the model are reviewed to be certain it properly achieves the
business objectives. A key objective is to determine if there is some important business issue
that has not been sufficiently considered. At the end of this phase, a decision on the use of
the data mining results should be reached
6. Deployment: Creation of the model is generally not the end of the project. Even if the pur-
pose of the model is to increase knowledge of the data, the knowledge gained will need to be
organised and presented in a way that the customer can use it.
8
Its final goal is to make the process repeatable, manageable and measurable (to be able to
get metrics). CRISP-DM is usually referred as an informal methodology (although it does not pro-
vide the rigid framework, task/inputs/outputs specification and execution, evaluation metrics, or
correctness criteria) because it provides the most complete tool set for DM practitioners. The cur-
rent version includes the reference process model and implementation user guide defining phases,
tasks, activities and deliverable outputs of these tasks.
It is clear from Figure 3 that CRISP-DM is the standard model and has borrowed principles and
ideas from the most important models (KDD, SEMMA, Two Crowds,. . . ) and has been the source
for many later proposals. However, many changes have occurred in the business application of
data mining since the former version of CRISP-DM was published: new data types and data mining
techniques and approaches, more demanding requirements for scalability, real-time deployment
and large-scale databases, etc. The CRISP-DM 2.0 Special Interest Group (SIG) was established
with the aim of meeting the changing needs of DM with and improved version of the CRISP-DM
process. Normally this version should have appeared in 2007, but was finally discontinued.
However, other process models based on the original CRISP-DM approach have appeared. Cios
et al.’s six-step discovery process (Cios et al., 2000; Cios and Kurgan, 2005) was first proposed in
2000 adapting the CRISP-DM model to the needs of the academic research community. The main
extensions include, among others, improved (research-oriented) description of the steps, explicit
feedback mechanisms, reuse of knowledge discovered between different domains, etc. The model
consists of six steps: understanding the problem domain, understanding the data, preparation of
the data, data mining, evaluation of the discovered knowledge and using the discovered knowledge.
The RAMSYS (RApid collaborative data Mining SYStem) (Moyle and Jorge, 2001) is a methodol-
ogy for developing DM and KD projects where several geographically diverse groups (nodes) work
together on the same problem in a collaborative way. This methodology, although based on CRISP-
DM (same phases and generic tasks), emphasises collaborative work, knowledge sharing and com-
munication between groups. Apart from the original CRISP-DM tasks, the RAMSYS methodology
proposes a new task called model submission (modeling step), where the best models from each
of the nodes are evaluated and delivered.
Finally, in 2015, IBM Corporation released ASUM-DM (Analytics Solutions Unified Method for
Data Mining/Predictive Analytics) a new methodology which refines and extends CRISP-DM. ASUM-
DM retained the âĂIJAnalyticalâĂİ activities and tasks of CRISP-DM but the method was augmented
adding infrastructure, operations, deployment and project management sections as well as tem-
plates and guidelines.
9
Context change Examples of parametrised context
Distribution shift (covariate, prior probability, concept) Input or output variable distribution
Costs and evaluation function Cost proportion, cost matrix, loss function
Data quality (uncertain, missing, or noisy information) Noise or uncertainty degree, missing attribute set
Representation change, constraints, background knowledge Granularity level, complex aggregates, attribute set
Task change Binarised regression cut-off, bins
from noisy data Angluin and Laird (1988); Frénay and Verleysen (2013), context-aware comput-
ing Abowd et al. (1999), mimetic models Blanco-Vega et al. (2006), theory revision Richards and
Mooney (1991), lifelong learning Thrun and Pratt (2012) and incremental learning Khreich et al.
(2012). Generally, in these areas the context change is analysed when it happens, rather than be-
ing anticipated, thus learning a model in the new context and reusing knowledge from the original
context.
A more proactive way to deal with context changes is by constructing a versatile model, which
has the distinct advantage that it is not fitted to a particular context or context change, and thus en-
ables model reuse. A new and generalised machine learning approach called Reframing Hernández-
Orallo et al. (2016) addresses that. It formalises the expected context changes before any learning
takes place, parametrises the space of contexts, analyses its distribution and creates versatile mod-
els that can systematically deal with that distribution of context changes. Therefore, the versatile
model is reframed using the particular context information for each deployment situation, and not
retrained or revised whenever the operating contexts change (see Figure 7). Rather than being an
umbrella term for the above-mentioned related areas, reframing is a distinctive way of addressing
context changes by anticipating them from the outset. Cost-sensitive learning Elkan (2001); Turney
(2000); Chow (1970); Tortorella (2005); Pietraszek (2007); Vanderlooy et al. (2006) and ROC anal-
ysis and cost plots Metz (1978); Flach et al. (2003); Fawcett (2006); Flach (2010); Drummond and
Holte (2006); Flach et al. (2011); Hernández-Orallo et al. (2011); Hernández-Orallo et al. (2012a);
Hernández-Orallo et al. (2013) can be seen as areas where reframing has been commonly used in
the past, and generally restricted to binary classification.
Generally speaking, the process of preparing a model to perform well over a range of different
operation contexts involves a number of challenges:
• Reuse of learnt knowledge: Models are required to be more general and adaptable to changes
in the data distribution, data representation, associated costs, noise, reliability, background
knowledge, etc. This naturally leads to a perspective in which models are not continuously
retrained and re-assessed every time a change happens, but rather kept, enriched and vali-
dated in a long-term model life-cycle. This lead us to the concept of versatile models, able to
generalise over a range of contexts.
• Variety of contexts and context changes: The process of preparing and devising a versatile
model to perform well over a range of operating contexts (beyond the specific context in
which the model was trained) involves to deal with a number of different possible context
changes that are commonly observed in machine learning applications: distribution shift Kull
and Flach (2014); Moreno-Torres et al. (2012); Quiñonero-Candela et al. (2009), cost and
evaluation function Elkan (2001); Turney (2000); Chow (1970); Pietraszek (2007); Tortorella
(2005); Vanderlooy et al. (2006), data quality Frénay and Verleysen (2013), representation
change Martínez-Usó and Hernández-Orallo (2015); Martínez-Usó et al. (2015), constrains,
background knowledge, task change Scheirer et al. (2013); Hernández-Orallo et al. (2016),. . .
10
Figure 7: Operating contexts, models and reframing. The model on the left is intentionally more
versatile than strictly necessary for context A, in order to ease its reframing to other contexts (e.g.,
B and C) without retraining it repeatedly.
• Context-aware approaches for machine learning: Retraining vs. Revision vs. Reframing
trilemma: Retraining on the training data is very general, but there are many cases where it
is not applicable. For instance, the training data may have been lost or may not exist (e.g.,
training models that have been created or modified by human experts) or may be prohibitively
large (if deployment must work in restricted hardware), or the computational constraints do
not allow retraining for each deployment context separately. Retraining on the deployment
data can work well if there is an abundance of deployment data, but often the deployment
data are limited, unsupervised or simply non-existent. A common alternative to retraining is
revision, Raedt (1992); Richards and Mooney (1991) where parts of the model are patched or
extended according to a new context (detection of novelty or inconsistency of the new data
with respect to the existing model). It is especially natural as a result of an incremental learn-
ing Khreich et al. (2012) or lifelong learning Thrun and Pratt (2012). Finally, reframing, as
11
said above, is a context-aware approach that reuses the model trained in the training context
by subjecting it to a reframing procedure that takes into account the particular deployment
context .
These challenges require a change of methodology. If we have to be more anticipative with con-
text, we need a process model where context is present from the very beginning, and the analysis,
identification and use of context (changes) must be part of several stages. This is what CASP-DM
undertakes.
4 CASP-DM
CASP-DM, which stands for Context-Aware Standard Process for Data Mining, is the proposed ex-
tension of CRISP-DM for addressing specific challenges of machine learning and data mining for
12
context and model reuse handling. CASP-DM model inherits flexibility and versatility from the
CRISP-DM life cycle and put more emphasis in that the sequence of phases is not rigid: context
changes may affect different tasks so it should be possible to move to the appropriate phase. This
is illustrated in Figures 8 (simplified) 9 (complete), where a flow chart shows which tasks in the
CASP-DM process model should be completed whenever a context change needs to be addressed.
Figure 9: Complete view of the CASP-DM tasks to be completed whenever (1) a new context-aware
DM project starts; or (2) a context change needs to be addressed.
In this section we overview the life cycle of a DM project by putting emphasis on those new and
enhanced tasks and outputs that have to do with context and model reuse handling (Figure 10).
Enhanced or new tasks/outputs are shown in dark red. Furthermore, a running example of model
reuse with bike rental station data (MoreBikes) Kull et al. (2015b) will be used to illustrate how
CASP-DM is applied in a real environment.
13
Figure 10: Legend of the different representation of original and new/enhanced tasks and outputs.
Figure 11: Phase 1. Business understanding: tasks and activities for context-awareness
The CASP-DM first phase “Business understanding” (as well as the second phase “Data under-
standing”) is the phase where the data mining project is being understood, defined and conceptu-
alized. The rest phases are implementation-related phases, which aim to resolve the tasks being set
in the first phases. As in the original CRISP-DM, the implementation phases are highly incremental
and iterative where the lessons learned during the process and from the deployed solutions can
benefit subsequent data mining processes.
The initial phase focuses on understanding the project objectives and requirements from a busi-
ness perspective, then converting this knowledge into a data mining problem definition and a pre-
liminary plan designed to achieve the objectives. Adapting this phase to address context changes
and model reuse handling involves: (1) adding new specialized tasks for identifying long term
reusability business goals (whether the business goals involve reusability, adaptability, and ver-
satility) w.r.t. context changes, (2) determining both data mining goals and success criteria when
we address a context-aware data mining problem (which type of context-aware technique should be
used depends on what aspects of the model are reusable in other contexts) and, finally, (3) perform
an initial assessment of available context-aware techniques and update the project plan describing
14
the intended plan for achieving the data mining goals and thereby achieving the reusability, adapt-
ability, and versatility business goals. The plan should specify the steps to be performed during
the rest of the project, including the initial identification of contexts (changes), and the reframing
techniques (I/O, structural) to deal with them.
MoReBikeS example 1.
Finding Business Objectives
Adaptive reuse of learnt knowledge is of critical importance in the majority of knowledge-
intensive application areas, particularly when the context in which the learnt model operates
can be expected to vary from training to deployment. The MoReBikeS challenge (Model Reuse
with Bike Rental Station Data) organised as the ECML-PKDD 2015 Discovery Challenge #1 Kull
et al. (2015a), is focused on model reuse and context change.
The MoReBikeS challenge was carried out in the framework of historical bicycle rental data
obtained from Valencia, Spain. Bicycles are continuously taken from and returned to rental
stations across the city. Due to the patterns in demand some stations can become empty
or full, such that more bikes cannot be rented or returned. To reduce the frequency of this
happening, the rental company has to move bikes from full or nearly full stations to empty or
nearly empty stations. Therefore the task is to predict the number of available bikes in every
bike rental stations 3 hours in advance. There are at least two use cases for such predictions.
• First, a specific user plans to rent (or return) a bike in 3 hours time and wants to choose
a bike station which is not empty (or full).
• Second, the company wants to avoid situations where a station is empty or full and
therefore needs to move bikes between stations. For this purpose they need to know
15
which stations are more likely to be empty or full soon.
• Outputs:
– Inventory of resources: Accurate list of the resources available to the project, including:
personnel, data sources, computing resources and software.
– Requirements, assumptions and constraints: List all requirements of the project (sched-
ule of completion, security and legal restrictions, quality, etc.), list the assumptions made
by the project (economic factors, data quality assumptions, non-checkable assumptions
about the business upon which the project rests, etc.) and list the constraints on the
project (availability of resources, technological and logical constraints, etc.).
– Risks and contingencies: List of the risks or events that might occur to delay the project
or cause it to fail (scheduling, financial, data, results, etc.) and list of the corresponding
contingency plans.
– Terminology: Compile a glossary of technical terms (business and data mining terminol-
ogy) and buzzwords that need clarification.
– Costs and benefits: Construct a cost-benefit analysis for the project (comparing the
estimated costs with the potential benefit to the business if it is successful).
MoReBikeS example 2.
Assessing the Situation
One of the first tasks the consultant faces is to assess the companyâĂŹs resources for data
mining.
• Data. Since this is an established company, there is plenty of historical information from
stations as well as information about the current status, time of the day/week/year,
geographical data, weather conditions, etc.
• Outputs:
16
– Data mining and context-aware goals: Describe the type of data mining problem. Initial
exploration of how the different contexts are going to be used. Describe technical goals.
Describe the desired outputs of the project that enables the achievement of the business
objectives.
– Data mining and context-aware success criteria: Define the criteria for a successful
outcome to the project in technical terms: describe the methods for model and context
assessment, benchmarks, subjective measurements, etc.
MoReBikeS example 3.
Data Mining Goals
Bike rental company needs to move bikes around to avoid empty and full stations. This can
be done more efficiently if the numbers of bikes in the stations are predicted some hours
in advance. The quality of such predictions relies heavily on the recorded usage over long
periods of time. Therefore, the prediction quality on newly opened stations is necessarily
lower. The goals for the study are:
• Use historical information about bike availability in the stations. In this challenge we
explore a setting where there are 200 stations which have been running for more than
2 years and 75 stations which have just been open for a month.
• Reuse the models learned on 200 “old” stations in order to improve prediction perfor-
mance on the 75 “new” stations. Combine information from similar stations to build
improved models. Hence, this challenge evaluates prediction performance on the 75
stations.
• By predicting the number of bikes in the new stations (3 hours in advance), the bike
rental company will be able to move bikes around to avoid empty and full stations.
• Outputs:
– Project plan: List the stages to be executed in the project, together with duration, re-
sources required, inputs, outputs and dependencies. Where possible make explicit the
large-scale iterations in the data mining process, for example repetitions of the modeling
and evaluation phases.
– Initial assessment of tools and techniques: At the end of the first phase, the project also
performs an initial assessment of tools and techniques, including the initial identification
of contexts (changes) and the context-aware techniques to deal with them.
17
MoReBikeS example 4.
MoReBikeS Example—Assessing Tools and Techniques
After setting the project plan for the study, an initial selection of tools and techniques should
be made taking into account contexts and context changes:
• In this challenge, context is the combination of station and time. It should be advisable
to use model combination, and retraining on sets of similar station.
Figure 12: Phase 2. Data Understanding: tasks and activities for context-awareness
The CRISP-DM phase 2 “Data understanding” involves an initial data collection and proceeds
with activities that enable you to become familiar with the data, identify data quality problems,
discover first insights into the data, and/or detect interesting subsets to form hypotheses regarding
hidden information.
To adapt this second phase to address the new needs, we have to enhance the initial data collec-
tion task in order to be able represent different relevant contexts. Through a further data exploration
we should be also able to contribute to or refine the data description, quality reports and informa-
tion about context representation, and feed into the transformation and other data preparation
steps needed for further analysis.
18
4.2.1 Collect initial data
• Task: Acquire the data (or access to the data) listed in the project resources. This initial
collection includes data integration if acquired from multiple data sources. Describe attributes
(promising, irrelevant, . . . ), quantity and quality of data. Collect sufficiently rich raw data to
represent possibly different relevant contexts. Collect sufficiently rich raw data to represent
possibly different relevant contexts.
• Outputs:
– Initial Data Collection Report: Describe data collected: describe attributes (promising,
irrelevant, . . . ), quantity and quality of data and identify relevant contexts.
MoReBikeS example 5.
Initial Data Collection
A procedure to store the number of bikes in all stations every hour has been set up. The
gathered data provides information about 275 bike rental stations in Valencia over a period of
2.5 years (from 01/06/2012 to 31/01/2015). For each hour in this period the data specified
the median number of available bikes during that hour in each of the stations. The dataset was
complemented with weather information about the same hour (temperature, relative humidity,
air pressure, amount of precipitation, wind directions, maximum and mean wind speed).
The bike rental data for Valencia have been obtained from http://biciv.com, weather in-
formation from the Valencian Regional Ministry for the Environment (http://www.citma.gva.
es/) and holiday information from http://jollyday.sourceforge.net/.
MoReBikeS example 6.
Describing Data
There are 24 given features in total which can be divided to 4 categories:
• Facts of stations. The facts of stations provided in the data set include the station ID,
the latitude, the longitude and the number of docks in that station. All these properties
for one station do not change over time.
• Temporal information. The timestamp of a data entry consists of eight fields: “Times-
tamp” in terms of seconds from the UNIX epoch, “Year”, “Month”, “Day”, “Hour”, “Week-
day”, “Weekhour”, and “IsHoliday” which indicates whether the day is a public holiday.
These features are giving overlapping temporal information, we only need a subset of
them to represent a time point. The “Timestamp” is actually including information of
19
“Year”, “Month”, “Day”, “Hour”,“Weekday” and “Weekhour”, whereas “Weekday” and
“Hour” also can be deduced by “Weekhour”. Only “IsHoliday” is independent to any of
others.
• Counts and their statistics. This set of features relates to the target value directly. First
of all, “bikes 3h ago” gives the target value of the 3-hour-earlier time point at a station.
The full profile features use all previous data points of the same “Weekhour” to obtain
long term statistics for each “Weekhour” in each station, accordingly the short profile
features only use at most four previous data points to obtain short-term statistics. The
long-term statistics of the 200 old stations only have very small changes over time in
contrast to the short-term ones
The target variable is “bikes” and it is a non-negative integer representing the median number
of available bikes during the respective hour in the respective rental station.
• Outputs:
– Data exploration report: Describe results of this task including (possibly using graphs
and plots) first findings, initial hypothesis, explorations about contexts, particular sub-
sets of relevant data and attributes and their impact on the remainder of the project.
MoReBikeS example 7.
Exploring Data
A lot of work should be done in this stage in the bike scenario. Taking pieces of domain
knowledge and checking whether they hold and identify interesting patterns. We can see that
different stations clearly exhibit different daily patterns. Most obviously, there are stations
that tend to be full in the night and emptier during the day. Essentially these are stations
that are on the outer areas of the city, and the bikes are used during the day to travel into
more central parts of the city. There are also stations that exhibit the opposite pattern. These
stations are left empty at night, since the operators know that the will fill up during the day as
people travel into the city. There are of course stations that fall between these two extremes.
20
4.2.4 Verify data quality
• Task: Examine the quality of the data: coding or data errors, missing values, bad metadata,
measurement errors and other types of inconsistencies that make analysis difficult.
• Outputs:
– Initial Data Collection Report: List and describe the results of the data quality verification
(is correct?, contain errors?, missing values?, how common are they?) and list possible
solutions.
MoReBikeS example 8.
Verifying Data Quality
Some of the issues encountered include missing values in the profile information about the
station that could be ignored. Timepoint features also have missing values and only the time-
points with existing values are used.
• Outputs:
– Rationale for inclusion/exclusion: List the data and context to be included/excluded and
the reasons for these decisions.
21
Figure 13: Phase 3. Data Preparation: tasks and activities for context-awareness
– Selected contexts and changes: Select contexts and context changes relevant to the
data mining goals, ignore the others. Select data to cover the selected contexts and
changes.
MoReBikeS example 9.
Selecting Data
Many of the decisions about which data and attributes to select have already been made in
earlier phases of the data mining process. Contexts are modelled as parameters (station and
timestamp) and both need to be modelled later (using all available data).
22
• Outputs:
– Data Cleaning Report: Report data-cleaning efforts (missing data, data errors, coding
inconsistencies, missing data and bad metadata) for tracking alterations to the data and
in order for future data mining projects to be benefited.
• Missing data. The missing values are ignored in all profile calculations, i.e. only the
timepoints with existing values are averaged.
• Outputs:
– Derived attributes: Derived attributes are new attributes that are constructed from one
or more existing attributes in the same record. Derive context-specific and context-
independent attributes.
– Derived attributes: Describe the creation of completely new records. Generate new data
to force context-invariance (e.g., rotated images in deep learning).
• There is one feature about the number of bikes in the station 3 hours ago: “bikes 3h
ago”. The profile variables are calculated from earlier available timepoints on the same
station.
• The “full profile bikes” feature is the arithmetic average of the target variable “bikes”
during all past timepoints with the same weekhour, in the same station.
• The “full profile 3h diffbikes” feature is the arithmetic average of the calculated feature
“bikes-bikes 3h ago” during all past timepoints with the same weekhour, in the same
station.
• The “short *” profile is the same as the full profiles except that it only uses past 4 time-
points with the same weekhour. If there are less than 4 such timepoints then all are
used.
23
4.3.4 Integrate data
• Task: These are methods whereby information is combined from multiple sources. There are
two basic methods of integrating data: merging two data sets with similar records but different
attributes or appending two or more data sets with similar attributes but different records.
• Outputs:
– Merged data: This includes: merging tables together into a new table; aggregation of
data (summarising information) from multiple records and/or tables and integrating data
from relevant contexts
• Outputs:
– Reformatted data: Syntactic changes made to satisfy the requirements of the specific
modeling tool. Examples: change the order of the attributes and/or records, add identi-
fier, remove commas from within text fields, trimming values, etc.
– Context representation: Select context representation. (How are the contexts going to
be represented in the data (parametrisation; as-feature vs as-dataset)?)
4.4 Modelling
In this phase, various modeling techniques are selected and applied, and their parameters are cal-
ibrated to optimal values. Typically, there are several techniques for the same data mining problem
type and this phase is usually conducted in multiple iterations. Some techniques have specific re-
quirements on the form of data, so going back to the data preparation phase is often necessary.
A new optional branch of reframe-based subtasks and deliverables has been added for selecting
the modelling technique. Therefore, we clearly differentiate between classical modelling techniques
and reframing techniques. Furthermore, enhanced procedures for testing the versatile modelâĂŹs
quality and validity (context plots and performance metrics) has been added. Specific reframing
24
Figure 14: Phase 4. Modelling: tasks and activities for context-awareness
tools are needed to build the versatile model. A new general task âĂIJREVISE MODELâĂİ for han-
dling model revision in incremental or lifelong learning data mining tasks. Furthermore, a new
general task âĂIJREFRAME SETTINGâĂİ has been added in this phase in order to decide which type
of reframing should be used (over the versatile model) depending on what aspects of the model
are reusable in other contexts. This task will be performed to adapt a versatile model w.r.t. a con-
text whenever the context changes. Finally, context-aware performance metrics are also needed to
assess the versatile model.
• Outputs:
– Modeling technique: Document the actual modeling technique that is to be used. In case
context matters, Select the model and reframing couple, e.g.. scorer and score-driven or
linear regression and continuous output reframing.
25
– Modeling assumptions: Many modeling techniques make specific assumptions on the
data, e.g., all attributes have uniform distributions, no missing values allowed, class at-
tribute must be symbolic etc. Record any such assumptions made.
26
month test period across 50 test stations, with different forecasting windows, grouped by
length of history and perhaps some meta-information about the station.
• Outputs:
– Parameter settings: Most modeling techniques have a a large number of parameters that
can be adjusted. List the parameters and their chosen value, along with the rationale for
the choice of parameter settings.
– Models: These are the actual models produced by the modeling tool, not a report.
– Model description: Describe the resultant model. Report on the results of a model and
any meaningful conclusion, document any difficulties or inconsistencies encountered
with their meanings.
• Reframe version: a possible solution to the problem consists of combining the predic-
tions of the K nearest stations among the old stations (1:200) to the target stations
(201:275) using the weighted arithmetic mean. On one hand, these predictions are cal-
culated applying the best model—in terms of MAE—for each old station (1:200). On the
other hand, the K nearest neighbours were obtained by comparing each target stations
(201:275) to all the old stations (1:200) in terms of the Euclidean distance between
them. Then, the K closest old stations to one target station were selected as its K nearest
neighbours. In doing so for every target station (201:275), their K nearest neighbours
were discovered among the old stations (1:200). The Euclidean distance between the
target station and its neighbours is used to weight the influence of their predictions on
the final prediction. Finally, this summation was divided by the sum of the k Euclidean
distances from each neighbour (among the K nearest neighbours) to the target station
on the test data. In doing so, the final prediction value was obtained from k predictions
taken into account in a different importance according to their proximity to the target
station.
• Retraining version: it consists on using the data of the roughly 2.5 year long period
between 2012 and 2014 for 10 docking stations in the city of Valencia as well as the
one month partial training data provided for 190 other stations throughout the city.
27
4.4.4 Revise Model
• Task: Once we have built a model and as a result of an incremental learning or lifelong learning,
the model needs to be revised(patched or extended) because of some novelty or inconsistency
of the new data is detected with respect to the existing model. This can be extended to context
changes, provided we can determine when the context has changed significantly to deserve
a revision process.
• Outputs:
– Model assessment: Summarize results of this task by using evaluation chars, analysis
nodes, cross-validation charts, etc.; list qualities of generated models (e.g., in terms of
accuracy) and rank their quality in relation to each other. In context-aware tasks, com-
pare with different scenarios, in particular retraining.
– Revised parameter settings: According to the model assessment, revise parameter set-
tings and tune them for the next run in the Build Model task. Iterate model building and
assessment until you strongly believe that you found the best model(s). Document all
such revisions and assessments.
• Outputs:
28
MoReBikeS example 18.
Reframe setting
In choosing the models to be reused it has to be decided the criteria for model suitability for
a given test station. These included: performance of the model in the test station during the
deployment period; distance between the test station and the station of the model’s origin;
similarity between the time-series of the stations during the deployment period; and several
combinations of these.
4.5 Evaluation
Once you have built a model (or models) that, according to the evaluation task in the previous
phase, appears to have high quality from a data analysis perspective, it is important to thoroughly
evaluate it (perhaps going through the previous phases) to be certain the model properly achieves
the business objectives. Therefore, this step requires a clear understanding of the stated business
goals. A key objective is to determine how well results answer your organizationâĂŹs business goals
and whether there is some important business issue that has not been sufficiently considered. At
the end of this phase, a decision on the use of the data mining results should be reached.
Regarding context-awareness, in this phase we need an enhanced task for assessing whether
the versatile model meets the business objectives in all the relevant contexts where they are to be
deployed. Furthermore, we need to decide whether the versatile model is able to be reused and
adapted to the deployment data or not.
29
4.5.1 Evaluate results
• Task: Unlike the previous evaluation steps which dealt with factors such as the accuracy and
generality of the model, in this step we need to assess the degree to which the model meets the
business objectives and, thus, this step requires a clear understanding of the stated business
goals. We need to determine if there is some business reason why this model is deficient, if
results are stated clearly, if there are novel or unique findings that should be highlighted, if
results raised additional questions, etc.
• Outputs:
– Assessment of data mining results with respect to business success criteria: Summa-
rize assessment results in terms of business success criteria, interpret the data mining
results, check the impact of result for initial application goal in the project, see if the
discovered information is novel and useful, rank the results, state conclusions, check
whether results cover all contexts relevant for the business success criteria, etc.
– Approved models: Select those (versatile) models which, after the previous assessment
with respect to business success criteria, meet the the selected criteria.
• New Questions. The most important questions to come out of the study are: How of-
ten the stations remain empty or full because of bad predictions? How much time is
wasted in carrying bikes around because of bad predictions? Can we use different eval-
uation measures in modelling to achieve better results? How often do we need to retrain
models?
• Outputs:
– Review of process: Summarize the process review and all the activities and decisions for
each phase. Give hints for activities that have been missed and/or should be repeated.
30
MoReBikeS example 20.
MoReBikeS Example—Review Report
As a result of reviewing the process of the initial data mining project, the bike rental com-
pany has developed a greater appreciation of the interrelations between steps an its inherent
“backtracking” nature. Furthermore, the company has learn that model reuse between similar
stations is appropriate when historical data is not provided or does not exists.
• Outputs:
– List of possible actions: List possible further actions along with the reasons for and
against each option: analyse potential for deployment and improvement(for each result
obtained), recommend alternative following phases, refine the whole process, etc.
– Decision: Describe the decision made: rank alternatives, document reasons for the choice
and how to proceed along with the rationale.
4.6 Deployment
Creation of the model is generally not the end of the project and deployment is the process of
using the discovered insights to make improvements (or changes) within your organization. Even
if the results may not be formally integrated into your information systems, the knowledge gained
will undoubtedly be useful for planning and making marketing decisions.This phase often involves
planning and monitoring the deployment of results or completing wrap-up tasks such as producing
a final report and conducting a project review.
Regarding context-awareness data mining tasks, in this phase we need to determine in what way
the versatile model (or the pull of models) is to be kept, used, evaluated and maintained for a long-
term use. Furthermore, we may need to monitor the possible change of the context distribution or
check whether its range is the same as expected. If not, we may need to revaluate some models for
a new distribution of contexts thus going back to previous phases/tasks.
31
Figure 16: Phase 6. Deployment: tasks and activities for reframing.
• Outputs:
– Deployment plan: This task takes the evaluation results and determines a strategy for
deployment. If a general procedure has been identified to create the relevant model(s)
and integrate within your database systems. This procedure is documented (step-by-
step plan and integration) here for later deployment (including technical details, benefits
of monitoring, deployment problems, etc.). Furthermore, create a plan to disseminate
the relevant information to strategy makers
– Model selection w.r.t. the context: Determine the pace in which context values are cap-
tured or estimated. Determine how the pool of models is going to be kept and selected
according to context.
32
MoReBikeS example 22.
Deployment Planning
Can we use different evaluation measures in modelling to achieve better results? How often
do we need to retrain models?.
• Outputs:
• Outputs:
– Final report: Final report where all the threads are brought together. It should include
a thorough description of the original business problem, the process used to conduct
data mining, how well initial data mining goals have been met, which (versatile) models
are reused again and again, budget and costs (cost of reframing? And retraining? How
significant has context been?)), deviations from the original plan, summary of data mining
results, overview of the deployment process, recommendations and insights discovered,
etc.
– Final presentation: Determine the pace in which context values are captured or esti-
mated. Determine how the pool of models is going to be kept and selected according to
context.
33
4.6.4 Produce final report
• Task: This is the final step of the CASP-DM methodology. In it we assess what went right and
what went wrong (and need to be improved), the final impressions, lessons learned, etc.
• Outputs:
5 Discussion
Data mining is a discipline with strong technical roots in statistics, machine learning and information
systems. The advance in techniques, tools and platforms, jointly with the increase of the availability
of data and the higher complexity of projects and teams, has been so significant in the past decade
that methodological issues are becoming more important to harness all this potential in an effi-
cient way. The perspective of data science, where data mining goals are more data-oriented than
business-oriented in a more classical direct data mining process may suggest that rigid method-
ologies cannot cope with the variability of problems, which have to be adjusted to related scenarios
very frequently, in terms of changes of data, goals, resolution, noise or utility functions.
In contrast, we have advocated here that successful methodologies, such as CRISP-DM, can
play this role if they become less rigid and accommodate the idea the variability of the application
in a more systematic way. The notion of context, its identification and parametrisation, is a gen-
eral way to anticipate all these changes and consider them from the very beginning. This is why
CASP-DM tries to extend CRISP-DM to make this possible. The explicit existence of activity and
tasks specifically designed for this context identification and handling ensures that companies and
practitioners will not overlook this important aspect and will plan data mining projects in a more
robust way, where data transformation and model construction can be reused and not jettisoned
whenever any contextual thing changes. We have illustrated how CASP-DM goes through these
context issues with some real examples.
CASP-DM not only considers context-awareness in the whole process, but is backward com-
patible with CRISP-DM, the most common methodology in data mining. This means that CRISP-
DM users can adopt CASP-DM immediately and even complement their existing projects with the
context-aware bits, making them more versatile. In order to do this transition from CRISP-DM to
CASP-DM, it is also important to have a stable platform and community where CASP-DM documents,
phases and planning tools can be integrated and located for data mining practitioners. For instance,
it is hard to find the CRISP-DM documentation, as nobody is maintaining it any more. To take that
reference role, we have set up a community around www.casp-dm.org, where data mining practi-
tioners can find information about CRISP-DM and CASP-DM, but also about context-awareness and
other related areas such as reframing and domain adaptation. It is also our intention to associate
a working group with this initiative, so that CASP-DM can also evolve with the new methodological
challenges of data mining.
34
References
Abowd, G. D., Dey, A. K., Brown, P. J., Davies, N., Smith, M., and Steggles, P. (1999). Towards a better
understanding of context and context-awareness. In Handheld and ubiquitous computing, pages
304–307. Springer.
Anand, S. S. and Büchner, A. G. (1998). Decision support using data mining. Financial Times Man-
agement.
Anand, S. S., Patrick, A., Hughes, J. G., and Bell, D. A. (1998). A data mining methodology for
cross-sales. Knowledge-Based Systems, 10(7):449–461.
Angluin, D. and Laird, P. (1988). Learning from noisy examples. Machine Learning, 2(4):343–370.
Bi, J. and Bennett, K. P. (2003). Regression error characteristic curves. In Twentieth International
Conference on Machine Learning (ICML-2003). Washington, DC.
Blanco-Vega, R., Ferri, C., Hernández-Orallo, J., and Ramírez-Quintana, M. J. (2006). Estimating the
class probability threshold without training data. ROC Analysis in Machine Learning, page 9.
Brachman, R. J. and Anand, T. (1996). Advances in knowledge discovery and data mining. chap-
ter The Process of Knowledge Discovery in Databases, pages 37–57. American Association for
Artificial Intelligence, Menlo Park, CA, USA.
Brunk, C., Kelly, J., and Kohavi, R. (1997). Mineset: An integrated system for data mining. In KDD,
pages 135–138.
Buchner, A. G., Mulvenna, M. D., Anand, S. S., and Hughes, J. G. (1999). An internet-enabled knowl-
edge discovery process. In Proceedings of the 9th international database conference, Hong Kong,
volume 1999, pages 13–27.
Cabena, P., Hadjinian, P., Stadler, R., Verhees, J., and Zanasi, A. (1998). Discovering data mining:
from concept to implementation. Prentice-Hall, Inc.
Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., and Wirth, R. (2000).
Crisp-dm 1.0 step-by-step data mining guide.
Chow, C. (1970). On optimum recognition error and reject tradeoff. Information Theory, IEEE
Transactions on, 16(1):41–46.
Cios, K. J. and Kurgan, L. A. (2005). Trends in data mining and knowledge discovery. In Advanced
techniques in knowledge discovery and data mining, pages 1–26. Springer.
Cios, K. J., Teresinska, A., Konieczna, S., Potocka, J., and Sharma, S. (2000). A knowledge discovery
approach to diagnosing myocardial perfusion. Engineering in Medicine and Biology Magazine,
IEEE, 19(4):17–25.
Debuse, J., de la Iglesia, B., Howard, C., and Rayward-Smith, V. (2001). Building the kdd roadmap.
In Industrial Knowledge Management, pages 179–196. Springer.
35
Drummond, C. and Holte, R. (2006). Cost Curves: An Improved Method for Visualizing Classifier
Performance. Machine Learning, 65:95–130.
Edelstein, H. A. (1998). Introduction to data mining and knowledge discovery. Two Crows.
Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. (1996a). The kdd process for extracting useful
knowledge from volumes of data. Commun. ACM, 39(11):27–34.
Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., and Uthurusamy, R. (1996b). Advances in knowledge
discovery and data mining.
Ferri, C., Hernández-Orallo, J., and Modroiu, R. (2009). An experimental comparison of performance
measures for classification. Pattern Recognition Letters, 30(1):27–38.
Flach, P. (2010). ROC analysis. In Encyclopedia of Machine Learning, pages 869–875. Springer.
Flach, P., Blockeel, H., Ferri, C., Hernández-Orallo, J., and Struyf, J. (2003). Decision support for
data mining. In Data Mining and Decision Support, pages 81–90. Springer.
Flach, P., Hernández-Orallo, J., and Ferri, C. (2011). A coherent interpretation of AUC as a measure
of aggregated classification performance. In ICML.
Frénay, B. and Verleysen, M. (2013). Classification in the presence of label noise: a survey. IEEE
Transactions on Neural Networks and Learning Systems, 25(5).
Gertosio, C. and Dussauchoy, A. (2004). Knowledge discovery from industrial databases. Journal
of Intelligent Manufacturing, 15(1):29–37.
Giraud-Carrier, C., Vilalta, R., and Brazdil, P. (2004). Introduction to the special issue on meta-
learning. Machine learning, 54(3):187–193.
Hand, D. (2009). Measuring classifier performance: a coherent alternative to the area under the
ROC curve. Machine learning, 77(1):103–123.
Harry, M. J. (1998). Six sigma: a breakthrough strategy for profitability. Quality progress, 31(5):60.
Hernández-Orallo, J., Ferri, C., Lachiche, N., Martínez-Usó, A., and Ramírez-Quintana, M. J. (2015).
Binarised regression tasks: methods and evaluation metrics. Data Mining and Knowledge Discov-
ery, pages 1–43.
Hernández-Orallo, J., Ferri, C., Lachiche, N., Martínez-Usó, A., and Ramírez-Quintana, M. J. (2016).
Binarised regression tasks: methods and evaluation metrics. Data Mining and Knowledge Discov-
ery, 30(4):848–890.
Hernández-Orallo, J., Flach, P., and Ferri, C. (2011). Brier curves: a new cost-based visualisation of
classifier performance. In ICML.
Hernández-Orallo, J., Flach, P., and Ferri, C. (2012a). A unified view of performance metrics: Trans-
lating threshold choice into expected classification loss. JMLR, 13:2813–2869.
36
Hernández-Orallo, J., Flach, P., and Ferri, C. (2012b). A unified view of performance metrics: Trans-
lating threshold choice into expected classification loss. Journal of Machine Learning Research,
13:2813–2869.
Hernández-Orallo, J., Flach, P., and Ferri, C. (2013). ROC curves in cost space. Machine Learning,
93(1):71–91.
Hernández-Orallo, J., Usó, A. M., Prudêncio, R. B. C., Kull, M., Flach, P. A., Ahmed, C. F., and Lachiche,
N. (2016). Reframing in context: A systematic approach for model reuse in machine learning. AI
Commun., 29(5):551–566.
Khreich, W., Granger, E., Miri, A., and Sabourin, R. (2012). A survey of techniques for incremental
learning of HMM parameters. Information Sciences, 197:105–130.
Kull, M. and Flach, P. (2014). Patterns of dataset shift. In Ws. on Learning over Multiple Contexts at
ECML2014 (LMCE).
Kull, M. and Hernández-Orallo, J. (2015). Missing values on purpose: Model selection and reframing
with attribute and prediction costs. submitted.
Kull, M., Lachiche, N., and Martınez-Usó, A. (2015a). Morebikes-model reuse with bike rental station
data.
Kull, M., Lachiche, N., and Usó, A. M. (2015b). Model reuse with bike rental station data (pream-
ble). In Proceedings of the ECML/PKDD 2015 Discovery Challenges co-located with European
Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases
(ECML-PKDD 2015), Porto, Portugal, September 7-11, 2015.
Lo, H.-Y., Wang, J.-C., Wang, H.-M., and Lin, S.-D. (2011). Cost-sensitive multi-label learning for
audio tag annotation and retrieval. Multimedia, IEEE Transactions on, 13(3):518–529.
Mariscal, G., Marban, O., and Fernandez, C. (2010). A survey of data mining and knowledge discov-
ery process models and methodologies. The Knowledge Engineering Review, 25(02):137–166.
Martínez-Usó, A., Hernández-Orallo, J., Ramírez-Quintana, M. J., and Plumed, F. M. (2015). Pentaho
+ R: An Integral View for Multidimensional Prediction Models, pages 234–244. Springer Interna-
tional Publishing.
Metz, C. E. (1978). Basic principles of ROC analysis. In Seminars in nuclear medicine, volume 8,4,
pages 283–298. Elsevier.
Moreno-Torres, J. G., Raeder, T., Alaiz-Rodríguez, R., Chawla, N. V., and Herrera, F. (2012). A unifying
view on dataset shift in classification. Pattern Recognition, 45(1):521–530.
Moyle, S. and Jorge, A. (2001). Ramsys-a methodology for supporting rapid remote collaborative
data mining projects. In ECML/PKDD 2001 Workshop on Integrating Aspects of Data Mining,
Decision Support and Meta-Learning: Internal SolEuNet Session, pages 20–31.
37
Pan, S. J. and Yang, Q. (2010). A survey on transfer learning. Knowledge and Data Engineering,
IEEE Transactions on, 22(10):1345–1359.
Pietraszek, T. (2007). On the use of ROC analysis for the optimization of abstaining classifiers.
Machine Learning, 68(2):137–169.
Quiñonero-Candela, J., Sugiyama, M., Schwaighofer, A., and Lawrence, N. D. (2009). Dataset shift
in machine learning. The MIT Press.
Raedt, L. D. (1992). Interactive Theory Revision: An Inductive Logic Programming Approach. Aca-
demic Press.
Richards, B. L. and Mooney, R. J. (1991). First-order theory revision. In ML, pages 447–451.
Scheirer, W. J., de Rezende-Rocha, A., Sapkota, A., and Boult, T. E. (2013). Toward open set recog-
nition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(7):1757–1772.
Thrun, S. (1996). Is learning the n-th thing any easier than learning the first? Advances in neural
information processing systems, pages 640–646.
Thrun, S. and Pratt, L. (2012). Learning to learn. Springer Science & Business Media.
Torrey, L. and Shavlik, J. (2009). Transfer learning. Handbook of Research on Machine Learning
Applications, 3:17–35.
Tortorella, F. (2005). A ROC-based reject rule for dichotomizers. Pattern Recognition Letters,
26(2):167–180.
Turney, P. (2000). Types of cost in inductive concept learning. Canada National Research Council
Publications Archive.
Vanderlooy, S., Sprinkhuizen-Kuyper, I., and Smirnov, E. (2006). An analysis of reliable classifiers
through ROC isometrics. In Proceedings of the ICML 2006 Ws. on ROC Analysis (ROCML 2006),
Pittsburgh, USA, June, volume 29, pages 55–62.
Xu, Z., Kusner, M. J., Weinberger, K. Q., Chen, M., and Chapelle, O. (2014). Classifier cascades and
trees for minimizing feature evaluation cost. JMLR, 15:2113–2144.
38