13-How Good Is Your Data

Stocks & Commodities V. 29:2 (42-47): How Good Is Your Data?
by Sunny Harris
TRADING TECHNIQUES
Libert! Fraternit!
How Good Is Your Data?

All data is equal at least thats what we think. Running experiments
I began by exporting the data for a single symbol from each
BRUCE WALDMAN
by Sunny J. Harris software application to a comma-separated value (Csv) text
Is
file. The instrument I chose was the Russell 2000 index,
all data equal? If truth be told, I never gave it much which has different symbols in different software, like Rut,
thought. I have been using one vendor nearly exclusively $Rut, and RU2000. I selected the Russell 2000 because of
for about 20 years. My fills are good enough. My closing its high liquidity, ease of use, and it is something little guys
prices seem to match what I see on television or find like us can trade.
online. As long as the profits roll in, there has been no Figure 1 shows the beginning of the spreadsheet, with the
reason to question the data. data of the two vendors (T and M) in the columns. At first
But then I was told by another vendor that my vendors glance it appeared that everything was in order, with small
data is off by just enough to generate a side income, discrepancies here and there. The differences in the data,
through the slippage from actual price to the price I where there is one, seem to be out in the hundredths place,
am presented. My curiosity was piqued, and so I decided to like 600.01 vs. 600.02. That wouldnt make much difference
investigate. First, I set up a spreadsheet and compared the over time, with some errors to the positive and the negative.
two vendors. To keep it simple, I considered only the past five It seems like it should be a wash.
years of data. My data experiment ran from June 30, 2005, to Next, I put columns in the spreadsheet to calculate the dif-
June 29, 2010. ferences between the open, high, low, and close (Ohlc) of each
Copyright Technical Analysis Inc.
Stocks & Commodities V. 29:2 (42-47): How Good Is Your Data? by Sunny Harris
Februar
Figure 1: data comparison, vendor t vs. vendor m. The differences in the data seems to be in the hundredths. Will deviations Figure 2: total differences, t
this small affect your profits? vs. m. In the first row of data you see
a sum of the differences in the open,
high, low, and close between the two
vendor. Part of that spreadsheet is shown in vendors. Now it gets interesting. The
Figure 2. At the top of each column, in the You can do all the testing in the closes are 52 points lower, the opens
are 40 points higher, the highs are
first row of data, is the result of calculating world, but when it comes to entering 65 points lower, and the lows are 48
the sum of all the differences between the
two vendors Ohlc data. I wouldnt have
real trades, the markets will hand points higher.
been surprised if each component had been you something unexpected.

consistently lower or higher than the other.
But these summation numbers show that
the data is all over the map. The closes are
52 points lower, the opens are 40 points
higher, the highs are 65 points lower, and
the lows are 48 points higher. The spread
between the numbers is alternating posi-
tive and negative. Could it be as one
vendor suggested that there is enough
of a spread in there for vendor T to cash
in on the spread alone?
With this information, I wanted to com-
pare the data I had come up with to another
well-known vendor to see whether their
data matched either vendor T or vendor
M. So I went back to the Export facility
in software G to create another set of
columns in the spreadsheet. I hoped that
the data from vendor G would match one
or the other of the first two vendors and I
would come up with an answer.
Figure 3 shows the data from vendor T
and vendor G, the new set for comparison.
Again, I gave the data a cursory glance,
but nothing seemed amiss. The variations FIGURE 3: DATA COMPARISON, T VS. G. The variations are, for the most part, in the hundredths. Nothing
are again out in the hundredths place. seems amiss.

TRADING TECHNIQUES
I found small discrepancies clock tells time a bit off from every other clock in the shop.
that led to large numbers when Theres no way to tell what time it really is. Which clock is
summed over time (Figure 4). telling the right time?
On its own, an error of 0.01 This situation demands that I compare the data from vendor
doesnt seem like much. But G to vendor T and also to vendor M. Im not sure what I would
when you add that up over five find out if none of it matched, but if one matched one other,
years of data, it is 1,257 trad- then Ill know something about the veracity of the vendors
ing days and an accumulated data that didnt match.
error of $12.57. Remember, Heres the spreadsheet I have for three vendors data so far
each point is worth $100 on (see Figure 5). Back to the differences spreadsheet, I inserted
the Rut. columns for calculating the new spreads: vendors T versus
This is where it starts look- G; T versus M; and G versus M. That setup will be compared
ing scary. Multiplying $12.57 against the other and maybe Ill get some clarity. The differ-
x 100 gives you $1,257. Thats ences section of the spreadsheet can be seen in Figure 6.
over $1,000 out of the traders Aha! Look at the zeroes in columns BD through BG. Reading
pocket. It isnt huge, but if you the description in row B over those columns (shaded green), I
are the vendor and you have see that the zeroes show up when comparing vendors M to G.
20,000 clients at $1,000 each, Still, looking at the numbers over the header Differences M
that comes to $20 million. v G, we see that despite all the zeroes there are discrepancies
That is $20 million over five along the way, giving us (9.17) among the closing values.
years. Now I was beginning to As I scanned the columns of this spreadsheet comparison,
understand what that vendor I found that on September 17, 2008, there was a difference of
FIGURE 4: TOTAL DIFFERENCES, T VS.
G. Although the discrepancies are small, was talking about. (8.83) between the close of M and the close of G. That was
when summed over five years of data it Still, I couldnt go anywhere where most of the error comes in.
could accumulate to $12.57 per point on the with this bit of information. How could these vendors have such differences among
Russell 2000. This situation was akin to hav- their data? Isnt the close the close, no matter who vends it?
ing a clock shop where each Next, I called the Russell 2000 exchange and got the data
FIGURE 5: THREE VENDORS DATA. Comparing data from three vendors will say something about the veracity of the data.

from them. Now that data should be cor-

rect, right? Its their own index, so they
should know. But they only had closing
prices for the first part of my experiment
(September 30, 2005, through June, 22,
2007). Closes will have to do for comparing
data vendors versus the Russell 2000 itself.
Their closing prices are accurate down to
six decimal places, while the others only
have two digits after the decimal.
Adding yet another set of columns to
the spreadsheet, I placed the RU2000
from the exchange in place. While I was
at it, I introduced another column with the
calculation for the range of each day. As
I looked at the data, I wondered whether
the errors worked themselves out by hav-
ing the same range for the day, even if the
open and close were different. Thats why
theres a column labeled Range. Figure
FIGURE 6: TOTAL DIFFERENCES, T VS. M VS. G. Even though there appears to be hardly any discrepan-
7 shows three data vendors and the Russell cies between M and G, the totals reflect another story.
2000 exchange data side by side.
The more columns I added, the more
difficult it is for you to read. So for those interested in the decided to add another data vendor, one who does not connect
details found in these spreadsheets, visit www.MoneyMentor. a brokerage firm to the data (as far as I know). Yahoo! makes
com/Articles.html, where you can see enlargements of these its data available for free, and because it is such a popular
figures. data source, it should have pretty clean data.
When you compare the closes of the Russell index to the As I looked closer at the data, I saw that some vendors
closes of vendors G, T, or M, there are slight discrepancies. I were using the first opened trade for the open value, and some
FIGURE 7: RUSSELL 2000 EXCHANGE DATA VS. T VS. M VS. G. If you look at the data of the close carefully, you will note there are slight discrepancies.

$RUT - Daily Russell 200 Index NASDAQ TRADING TECHNIQUES
L= 703.57 -1.16 -0.16% B=0.00 A=0.00 0=709.06 Hi=771.43 Lo=699.79 V=0 +RU2:R2K CVS
760.00
750.00
740.00 no difference in the data point from vendor
730.00 G and the data point from M. You can see a
720.00 spreadsheet of the differences between all
710.00
706.00 vendors on my website.
703.57
700.00
Slight, though
evident discrepancies 690.00 Putting it all together
It is clear that there are many differences
680.00
between data vendors. The close is not the
670.00
close all the world around, and there is no
way to evaluate which is better. The better
Sep 2 8 15 22 29 Oct
data is the data that most closely approximates
FIGURE 8: TWO DATA VENDORS ON ONE CHART. Here you see two sets of data overlaid on top of what you would experience in actual trad-
each other. The orange tick is the data from the exchange and the blue and green ticks are from T. Some
ing scenarios. The problem is, I dont know
bars such as the first, third, and fifth do have a discrepancy.
how to run that experiment. I could set up an
automated system in each software, where it
were using the opening range of the first few minutes. The would enter each trade in the markets on its own. Then, after
same applied to the close, in that some use the value at the letting the systems run for a year or so, we could compare the
bell and some use the range as all the last few orders trickle results of each trading experiment to evaluate the accuracy of
in. In Figure 8, I overlaid two sets of data. You can see where the underlying data. Other than that, it is a matter of personal
the orange tick is at a different location than the green tick. experience.
Orange stands for the Russell 2000 data from the exchange,
while the underlying blue and green are from vendor T. System testing
Looking at the chart provided a clearer picture. You can The data between these five sources varies, sometimes widely.
see right from the first bar on this chart that the open tick (to What if the data is different? It matters when you are entering
the left) has an orange one and a green one, only slightly dif- trades in the markets, especially if youre trading at a very
ferent, but different nonetheless. fast pace. It doesnt matter so much if you are off a penny in
On the third bar from the left there is some difference be- a trade that lasts for a year, or a month, or even a week. But
tween the two opens, though the closes are equal. Similarly if you are scalping for pennies, then the data you are making
on the fifth bar, you can visualize the discrepancy across the your decisions with needs to be exactly the same as the actual
chart. The differences are subtle, but they are there. trades happening in the real market.
As for my own trading, I entered and exited on market or- Running experiments down to the pennies is not within the
ders. Or I let a stop take me out. In neither case was it crucial scope of this article. I have limited the scope of these tests to
that I placed orders on the open of the bar, even on the close daily charts over the past five years of data. This will illus-
of the bar. trate the differences between the data sources when applied
However and this is a big however when writing and to hypothetical trading.
testing system ideas, many, if not most, coders specify things I will run the same experiment on all five sets of data. For
like: vendors T, G, and M, the data is supplied by the vendor, so
they go hand in hand. For vendor R, there is no software as-
IF condition1 THEN BUY next bar on the OPEN; sociated with the exchange, so I am going to import the data
IF condition2 THEN SELL on CLOSE; from a Csv text file into Ts software and run the tests from
there. The same applies to the data from Y: the data will be
If I tested such code against these five datasets I would get imported from a Csv file into Ts software and tested from
different results, different profits and losses, depending on the there. I will then view the results of the tests by looking at
data vendor or software vendor. So which results are correct? the performance reports correlated to each dataset.
The correct data is the set that gives the same results as actual In order to set up the experiment, it is necessary to hold
trades entered into the market would yield. And that brings us constant as many variables as possible, so that you compare
to the heart of the matter. Do you want to put your money into apples to apples and get meaningful results. Here are the
the markets in a reversal system like, say, the moving aver- constraints I employed:
age convergence/divergence (Macd), just to see whether the
n Trade only one contract
trades it comes up with replicate the trades the hypothetical
system generates? Of course not, and neither do I. So we are n Constrain the data to the time frame October 13, 2005,
at an impasse. to October 13, 2010
Ill come back to the impasse in a minute, but for now lets
n Do not allow pyramiding
get back to the data comparisons. From Figure 6 you can see
that vendors M and G are very close in the data they provide. n Limit the input values to 12, 26, 9
Most of the cells in the spreadsheet contain zeroes; there is
Stats Vendor G Vendor M Vendor R Vendor T Vendor Y TRADING TECHNIQUES
Total net profit <$221.00> <$302.60> <$6.17> <$272.94> <$275.03>
Profit factor 0.82 0.76 0.99 0.78 0.78
# Trades 102 99 55 100 100%
Profitable 30.4% 29.29% 34.55% 30.00% 30.00%
in just a few minutes, depending on
Avg trade NP <$2.00> <$3.06> <$0.11> <$2.73> <$2.75>
how quickly your order is entered in
Ratio avg win: Avg loss 1.89 1.83 1.88 1.83 1.82
a fast-moving market.
Avg bars in trades 11.98 12.6 14.67 12.72 12.72
Backtesting is not meant to provide
Account size reqd $511.00 $508.80 $258.81 $507.80 $506.64
precise replication of what would
CPC Index 0.46 0.41 0.62 0.43 0.43
happen in the real markets. It is meant
to give you an overall impression of
FIGURE 9: PERFORMANCE REPORTS FROM ALL FIVE VENDORS. Statistically, there is little difference among the whether your concepts are viable. The
vendors as far as performance data goes. However, note that R (the exchange) has the best performance overall.
markets never again do the exact same
thing they did before. They may echo
n Enter at the market, not on the open or close of the similar patterns, but they dont duplicate them precisely. You
signal bar can do all the testing in the world, but when it comes down
to entering real trades, the markets will hand you something
With these values in mind, I ran one test and compared the unexpected.
results. At the beginning, I got wildly different answers. Be- Larry Williams said something that has always stuck with
cause of the different philosophies of the software vendors, it me: It takes time to make profits. Making a few dollars at
was challenging finding the locations of the settings of things a time to make $100 a day is against my trading philosophy.
like trading one contract versus trading 100 contracts at a time. I believe in mathematical analysis of patterns and detection
But with diligence, I got them all set up identically. of setups that are likely to predict important pivot points and
If you were to look at the charts of the data from each turns in the markets.
vendor, from a visual perspective the results look similar. To In the end, all five of these data sources are perfectly ad-
inspect the data more closely, I broke it down into a tabular equate for testing and for trading. Its all relative.
format. I am not going to display all of the statistics, only the
most important to the analysis (Figure 9). A trader, author, computer programmer, and mathematician,
Statistically, there is little difference among the four vendors, Sunny Harris has been trading since 1981. The first printing of
as far as performance data goes. What stands out, however, is her first book, Trading 101: How To Trade Like A Pro, sold out
the difference between the collections of the vendors against in two weeks, and continues to be a financial best-seller, and
the data from the exchange itself. Vendor R the exchange her second book, Trading 102: Getting Down To Business, also
has the best performance overall and is the one dataset achieved record sales. In early 2000, Harris released Electronic
different from the others. Day Trading 101, followed by Getting Started In Trading in
Im not going to get into all the results of the experiments. 2001. She may be contacted at MoneyMentor.com.
For the purposes of this article, I am not trying to find whether
the Macd system works, rather attempting to uncover dis- Related reading
crepancies among the data available for analyzing and trading. www.MoneyMentor.com/Articles.html
Admittedly, this is one set of data on the Russell 2000, and one Microsoft Excel TradeStation
set of parameters for only the Macd reversal system. It is not a See Editorial Resource Index
comprehensive test, and by any means not a full analysis. But
it is useful for answering the question posed by one vendor
when touting the accuracy of their data. The outcome is not
dramatic. All the data vendors present a losing outcome for
the standard Macd strategy.
The only one close to positive is the data from the exchange;
the rest are all negative by the same amount. It would be
interesting to run comprehensive tests over a variety of time
frames, optimizing the parameters and using other types of
orders besides just buying at the market. But it would be an
extensive test with thousands of outcomes.
The bottom line

The trades entered into real-time markets will vary far more
than the data for the trades in this simple experiment. Over
approximately 100 trades, there is only a few dollars varia-
tion. In real-life trading, your fills will have wider ranges as
markets move faster or slower, and the range between bids I am delighted to confidentially tell you our annual report
and asks widen and narrow. In real-life trading, by trading this year is a veritable labyrinth that exhaustively obfuscates
just one share, you can easily lose double the amounts shown any of our credit default swaps transactions.

13-How Good Is Your Data

Uploaded by

Copyright:

Available Formats

13-How Good Is Your Data

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

13-How Good Is Your Data

Uploaded by

Copyright:

Available Formats

Stocks & Commodities V. 29:2 (42-47): How Good Is Your Data?

How Good Is Your Data?

been surprised if each component had been you something unexpected.

Copyright Technical Analysis Inc.

Copyright Technical Analysis Inc.

from them. Now that data should be cor-

Copyright Technical Analysis Inc.

The bottom line

Copyright Technical Analysis Inc.

You might also like