Rapidminer 6 Manual English
Rapidminer 6 Manual English
Rapidminer 6 Manual English
Studio 6
User Manual
RapidMiner Studio 6
User Manual
26th May 2014
RapidMiner
www.rapidminer.com
c 2014 by RapidMiner GmbH. All rights reserved.
No part of this publication may be reproduced, stored in a retrieval system, or
transmitted, in any form or by means electronic, mechanical, photocopying, or
otherwise, without prior written permission of RapidMiner GmbH.
Contents
1 Fundamental Terms 1
1.1 Coincidence or not? . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Fundamental Terms . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Attributes and Target Attributes . . . . . . . . . . . . . . . 6
1.2.2 Concepts and Examples . . . . . . . . . . . . . . . . . . . . 9
1.2.3 Attribute Roles . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.4 Value Types . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.5 Data and Meta Data . . . . . . . . . . . . . . . . . . . . . . 14
1.2.6 Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2 First steps 19
2.1 Installation and First Repository . . . . . . . . . . . . . . . . . . . 20
2.2 Perspectives and Views . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Design Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.1 Operators and Repositories View . . . . . . . . . . . . . . . 28
2.3.2 Process View . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3.3 Operators and Processes . . . . . . . . . . . . . . . . . . . . 31
2.3.4 Further Options of the Process View . . . . . . . . . . . . . 42
2.3.5 Parameters View . . . . . . . . . . . . . . . . . . . . . . . . 45
2.3.6 Help and Comment View . . . . . . . . . . . . . . . . . . . 47
2.3.7 Overview View . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.3.8 Problems and Log View . . . . . . . . . . . . . . . . . . . . 50
3 Design of Analysis Processes 53
3.1 Creating a New Process . . . . . . . . . . . . . . . . . . . . . . . . 53
V
Contents
3.2 Repository Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.3 The First Analysis Process . . . . . . . . . . . . . . . . . . . . . . 56
3.3.1 Transforming Meta Data . . . . . . . . . . . . . . . . . . . 58
3.4 Executing Processes . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.4.1 Looking at Results . . . . . . . . . . . . . . . . . . . . . . . 69
3.4.2 Breakpoints . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4 Data and Result Visualization 75
4.1 Result Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.1.1 Sources for Displaying Results . . . . . . . . . . . . . . . . 76
4.2 About Data Copies and Views . . . . . . . . . . . . . . . . . . . . 79
4.3 Display Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.3.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.3.2 Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.3.3 Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.3.4 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.3.5 Special Views . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.4 Result Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5 Repository 95
5.1 The RapidMiner Studio Repository . . . . . . . . . . . . . . . . . . 95
5.1.1 Creating a New Repository . . . . . . . . . . . . . . . . . . 97
5.2 Using the Repository . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.2.1 Processes and Relative Repository Descriptions . . . . . . . 99
5.2.2 Importing Data and Objects into the Repository . . . . . . 100
5.2.3 Access to and Administration of the Repository . . . . . . . 103
5.2.4 The Process Context . . . . . . . . . . . . . . . . . . . . . . 104
5.3 Data and Meta Data . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.3.1 Propagating Meta Data from the Repository and through
the Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
VI
1 Motivation and
Fundamental Terms
In this chapter we would like to give you a small incentive for using data mining
and at the same time also give you an introduction to the most important terms.
Whether you are already an experienced data mining expert or not, this chapter
is worth reading in order for you to know and have a command of the terms used
both here and in RapidMiner.
1.1 Coincidence or not?
Before we get properly started, let us try a small experiment:
Think of a number between 1 and 10.
Multiply this number by 9.
Work out the checksum of the result, i.e. the sum of the numbers.
Multiply the result by 4.
Divide the result by 3.
Deduct 10.
The result is 2.
1
1. Fundamental Terms
Do you believe in coincidence? As an analyst you will probably learn to answer
this question in the negative or even do so already. Let us take for example what
is probably the simplest random event you could imagine, i.e. the toss of a coin.
Ah you may think, but that is a random event and nobody can predict which
side of the coin will be showing after it is tossed. That may be correct, but
the fact that nobody can predict it does in no way mean that it is impossible in
principle. If all inuence factors such as the throwing speed and rotation angle,
material properties of the coin and those of the ground, mass distributions and
even the strength and direction of the wind were all known exactly, then we would
be quite able, with some time and eort, to predict the result of such a coin toss.
The physical formulas for this are all known in any case.
We shall now look at another scenario, only this time we can predict the outcome
of the situation: A glass will break if it falls from a certain height onto a certain
type of ground. We even know in the fractions of the second when the glass
is falling: There will be broken glass. How are we able to achieve this rather
amazing feat? We have never seen the glass which is falling in this instant break
before and the physical formulas that describe the breakage of glass are a complete
mystery for most of us at least. Of course, the glass may stay intact by chance
in individual cases, but this is not likely. For what its worth, the glass not
breaking would be just as non-coincidental, since this result also follows physical
laws. For example, the energy of the impact is transferred to the ground better
in this case. So how do we humans know what exactly will happen next in some
cases and in other cases, for example that of the toss of a coin, what will not?
The most frequent explanation used by laymen in this case is the description of
the one scenario as coincidental and the other as non-coincidental. We shall
not go into the interesting yet nonetheless rather philosophical discussions on this
topic, but we are putting forward the following thesis:
The vast majority of processes in our perceptible environment are not a result
of coincidences. The reason for our inability to describe and extrapolate the
processes precisely is rather down to the fact that we are not able to recognise or
measure the necessary inuence factors or correlate these.
2
1.1. Coincidence or not?
In the case of the falling glass, we quickly recognised the most important char-
acteristics such as the material, falling height and nature of the ground and can
already estimate, in the shortest time, the probability of the glass breaking by
analogy reasoning from similar experiences. However, it is just that we cannot
do with the toss of a coin. We can watch as many tosses of a coin as we like; we
will never manage to recognise the necessary factors fast enough and extrapolate
them accordingly in the case of a random throw.
So what were we doing in our heads when we made the prediction for the state
of the glass after the impact? We measured the characteristics of this event. You
could also say that we collected data describing the fall of the glass. We then
reasoned very quickly by analogy, i.e. we made a comparison with earlier falling
glasses, cups, porcelain gurines or similar articles based on a similarity measure.
Two things are necessary for this: rstly, we need to also have the data of earlier
events available and secondly, we need to be aware of how a similarity between
the current and past data is dened at all. Ultimately we are able to make an
estimation or prediction by having looked at the most similar events that have
already taken place for example. Did the falling article break in these cases or
not? We must rst nd the events with the greatest similarity, which represents
a kind of optimisation. We use the term