A number of my friends have told me that the sexiest job in the world soon would be that of a statistician.

I am not a statistician - I am a researcher who happened to believe that statistics is a tool that could be used to extract information and uncover insights from phenomena in the world. I also happen to be interested in all things "numeratic": linear algebra, calculus, ordinary differential equations, operations research, queuing theory....

I have used statistics since day1 in my first job as a media strategist at Basic Ad, working on accounts such as McDonald's which demanded that their media plans be as accurate as possible (i.e., gotta use linear regression to predict with certainty the ratings) and which encouraged us to assess the relationship between their sales and advertising promotions.

Since then, I have not looked back.

A few years ago, I got introduced to the theory of econometrics - how statistical principles are used to explain economic phenomena, either in the macro- or microeconomic setting. Then there was the growing interest in the marketing field towards using econometrics to justify media budgets and laydowns, as well as channel selections and all that.

It took not that long enough to realize that well, anybody who has the software and the patience to go through some wonderful things with numbers using matrices and spreadsheets, can do any model. Heck, I even fell into the same trap - creating models that didn't make any sense.

That is when I realized that econometrics is all about - and perhaps, what statistical modeling on which econometrics is based:

### Creating, formulating, and coming up with equations relating two or three or more variables can be done by almost anyone who is interested.

It is like anybody who has the capacity to understand the saying (questionable, it may be) that "the past could be a good indicator of the future" and has the mathematical interest and bandwidth to create models can.

Take this case for example:

I am fascinated by the movements of the stock markets. (And I am not an expert in finance nor in equity trading.) Particularly, I was interested in the movement of the stock market.

I hypothesized - urged on by the current global economic crisis - that stock markets must be related one way or another: That the Straits Times Index (STI) of Singapore should be affected by movements in the NASDAQ, the S&500, FTSE, VIX, and others.

My hypothesis was that investors in the STI do take an interest in the overnight performance of these indicators: NASDAQ and the S&P500 were indicators for the US, FTSE is for the Euro markets, and VIX is an indicator that somehow reflects the "risk-attitudes/risk-expectations" of investors.

I therefore hypothesized that when the NASDAQ goes up and the S&P500 go up, and so do the FTSE - the STI would also go up. But since investors are risk-averse, I also thought that the higher the VIX (and therefore the higher the expectations of risks and volatility in the future), it should impact the STI negatively.

With this very little amount of information and hypothesis, I went on and crafted my models.

(Hold you criticisms of multicollinearity and autocorrelations... For now, I am loosening the assumptions for my simple experiment.)

So... after several attempts, I got the following first model:

The model used the abovementioned weekly variables - FTSE, NASDAQ, S&P500, VIX - as predictors and ST as the variable. (I also created an index - a NASDAQ/S&P500 index - since NASDAQ was more tech-orientated and SIngapore is more closely related to the tech-orientated indutries.)

Using data from 2000, the model (in red) seemed to track the data fairly well. It was had several underpredictions at first, which attributed to the post-tech-bubble and 2001-2002 uncertainties. The goodness-of-fit indicators were acceptable. Adjusted R-squared is 0.721 - not bad! The residuals were also fairly good.

(Again, hold off discussions of autocorrelations and multicollinearity.)

I figured, "Hmm, there must be a better way - could there be a better way?" Can I increase my R-squared and other GIFs?

So after several (perhaps 25-30) iterations, I achieved the following:

Not too shabby, I said to myself - indeed, the GIF's increased. Amongst them, an R-squared of 0.976 - for the movement of the STI.

(To the technicaly savvy: This was arrived using the same variables, but this time using weighted least squares.)

Obviously, I was giddy - I may just have modeled the movements of the STI in the past 9 years with simple mathematics and good old Microsoft Excel (and an add-in called XLSTAT). I have derived the respective weights and parameters of each of the variables - and established their significance at an alpha-level of 0.01! (In some cases, some variables are actually less than 0.01!)

Then the realization crept in:

### But modeling - econometrics, statistical modeling - is beyond creating models and equations. It is creating explanations of why things have moved and acted that way in the past in order to clearly see the future.

And that was when I got into trouble.

So... the above are the most significant positive predictors. The "FTSE x SP500" taken together explains a significant amount of volatility - not my hypothesis realy, but close enough. But surprisingly, VIX also had a * significant positive effect*. Not only that: If we multipled NASDAQ with VIX, there is an almost equally strong effect on the STI. Squaring the NASDAQ and dividing it by the S&P500 also indicated that, well, it explained the changes in the STI - at an alpha level of less than 0.0001!

It became interesting when I started looking at what I called the "STI dampeners":

S&P500 and FYSE on their own were dampeners - and could shave off values in the STI. Multipliying VIX with S&P500 also indicated a significant decline. The most perplexing one was the last variable: VIX (risk-perceptions/expectations) x NASDAQ cumulatively divided by the S&P500 was the most significant dampener!

### And again, modeling - econometrics, statistical modeling - is beyond creating models and equations. It is creating explanations of why things have moved and acted that way in the past in order to clearly see the future.

And that was when I got into trouble: **why?**

Why would these relationships exist? Why would the FTSE and the S&P500 be dampening the STI? Don't we get cues from these two markets? NASDAQ - on its own - was not a significant predictor, with t-statistic that is way below the acceptable range.

One could argue that it is the complexity thinking of investors - but still, why would VIX not be a dampener? And it is only when one multiplied VIX with the S&P500 that I could accept that VIX is a dampener for the STI!

### Econometrics is about why's - and not just the numbers and formulae.

I am no finance expert - I have a very basic understanding of the equity markets. I have some knowledge of finance theory. But I just cannot explain these numbers - and the formulae that I have derived.

If you ask me what the numbers and the formulas meant, I wouldn't be able to explain. I could only say for certain that the suggested model from the equations tracked the real data well. I could recite - in a non-monotonous voice (because I try to be not too boring when I present data) - the effects of each in terms of numbers. And perhaps, even come up with an interesting visualization of the impacts of each of the factors and their interactions.

But ask me why VIX is positive on its own but now when it is multiplied with another variable - and I'd be lost.

Because econometrics - and statistical - modeling is all about uncovering the stories behind the numbers and explaining them.

Sure, because I now have the formulae for predicting the future of the STI given varying levels of the FTSE, the NASDAQ, the S&P500, and the VIX indices (predicting them altogether is another story), a software can be created, risk-tables associated with varying levels of the predictor indices can be programmed, and decisions can be made. One can even make 'self-correcting', dynamically-adapting models from these models.

But as to why these formulae hold, I have no clue.

**And that is where I would fail against the tenets of econometrics and statistical modeling: The explanatory power, the stories, the why's and the how's and the how-come's.**

That is where people's brains - and dare I say, people like me - come in. Because after all the formulas have been established and tested, after all the parameters have been derived, and their individual and aggregate/combined effects on the predicted variable have been uncovered, the question then becomes **"Why?"**

The bottom-line:

- Whether we like it or not, theory and hypotheses before even the modeling starts is critical.
- Whether we like it or not, computers - for now - cannot take the place of creating explanations and stories. Maybe, there will be a time when stories can be crafted by computers. But for now, brains still matter.
- We need both people with technical and theoretical statistical and econometric background - but in the field of finance, of consumer behavior, of marketing... we also need experts who can ground these formulae into the real business world. Into answers for real business questions.

To paraphrase Joyce Kilmer:

Models are made by fools like you and me

But only real integrative minds can make it gounded on reality.

---------------------------------------------------------------

PS. I got the data from Yahoo! Finance (http://finance.yahoo.com) which I think is one of the more friendler consumer-orientated finance sites on the web for all the indices from January 2000 till August 2009.

In the first model, I used OLS, for the second model I used Weighted Least Squares with the last observations weighted 10x more than the older data.

The residuals, by the way, for the first model are slightly heteroscedastic. The second one fared better. Both models, however, fail the Durbin-Watson Tests, which means they are autocorrelated at order 1. The models also fail the more general Breusch-Godfrey Test, which means there are higher AC-orders.

My suggestion: use time-series if you are really interested. ARIMA would do you good - and GARCH modeling.

It seems you are bumping up agsainst the limits of correlation. There are limits to knowing in something as dynamic the global stock market, but would a qualitative approach, followed by an experimental approach take you a step further in your quest for understanding. With semantic computing progressing, it would seem that large scale qualitative study is becoming possible.

P.S. I doubt all but a few could distinguish you from a statistician. I think your sex appeal is safely secured for the time being.

Posted by: Howard | 13 August 2009 at 11:21

Indeed, I am challenging the rules of correlation. And indeed, there are limits to learning something as dynamic as the stock market. My point was more to say "hey, if I wanted to create an equation between two things, it can be done... and it may prove to be statistically significant and logical and obey rules and assumptions [the above doesn't :)]".

I would agree with you that there is a need for a qualitative-quantitative approach in understanding data. We live in a data-rich world - where data is simply available, and software that can crunch numbers are also cheaper, more widely available...

But that doesn't mean we are narrowing the gap between what is happening really.

I think there is hope in semantic computing and statistical relational learning and methods. I think the social media revolution is not just about the connections between people - but the connections between information bits. And that's where we need to evolve to.

Thanks for the comments!

Posted by: Philip Tiongson | 15 August 2009 at 02:47