Data We Have vs Data We Need
Analysts sometimes omit an important step in their research — asking whether the data available are right for the research task proposed.
For our purposes, research may be roughly divided into three types:
- Historical,
where we study, explain, and interpret the past; - Contemporary,
where we observe and measure current phenomena; and - Predictive,
where we attempt to forecast the future.
accuracy of the research results decline as we move from working in the past to predicting the future.
This may be clearer if we leave the realm of financial markets, where many seem to have already formed conclusions, to a different problem — forecasting the World Series. (Yes, I know that the results are in, but it was a good example when I wrote this, flying over Montana some time ago).
How do we predict who will win the
World Series? Studying baseball is like
paradise for the quantitative researcher. There is a wealth of historical information, and no shortage of theories
about what leads to success on the part of the winning teams. There is also good statistical information
about the current performances of the contending teams.
Data analysis is important. In Moneyball, Michael
Lewis showed the relevance of quantitative analysis to building and managing a
good team. Many investment professionals
took note of the many similarities between the process of finding undervalued
baseball players and finding undervalued stocks.
Despite the wealth of data and
models, the outcomes may still surprise. First we had the early elimination of
the New York Yankees, the strong favorite of expert forecasters. In the Series we had the defeat of the Tigers, strong in both the regular season and the playoffs, by a team with a modest overall record. What is the problem?
We have good data on the performances of Ty Cobb, Tom
Seaver, Mickey Mantle, Ozzie Smith, and other former players for the teams in the baseball
playoffs, but that knowledge is not much help in predicting this year’s
result. We also know how Derek Jeter, Albert Pujols, and
Maglio Ordonez played during the regular season, but is it wise to forecast
that performance to continue into the playoffs?
Returning to the financial world, much of the data actually changes, growing more accurate with age. In the case of Payroll Employment Data, for example, the initial report is based upon only 65% of the establishments surveyed. (What a surprise! Not all businesses file a government report on time). Even after two revisions, the report is not final. When the results are eventually benchmarked against actual data, the BLS finally learns about businesses left out of the original sample and corrects the results.
Then the record books are rewritten. This is extremely important, as we shall see in Part II of this series.