Tuesday, July 7, 2009

Statistical Data Analysis Explained

Statistical Data Analysis Explained
Applied Environmental Statistics with R
Clemens Reimann, Peter Filzmoser, Robert G. Garrett, Rudolf Dutter | PDF | 359 pgs | 17mb

Statistical data analysis is about studying data – graphically or via more formal methods. Exploratory Data Analysis (EDA) techniques (Tukey, 1977) provide many tools that transfer large and cumbersome data tabulations into easy to grasp graphical displays which are widely independent of assumptions about the data. They are used to “visualise” the data. Graphical data analysis is often criticised as non-scientific because of its apparent ease. This critique probably stems from many scientists trained in formal statistics not being aware of the power of graphical data analysis.

Occasionally, even in graphical data analysis mathematical data ransformations are useful to improve the visibility of certain parts of the data. A logarithmic transformation would be a typical example of a transformation that is used to reduce the influence of unusually high values that are far removed from the main body of data.

Graphical data analysis is a creative process, it is far from simple to produce informative graphics. Among others, choice of graphic, symbols, and data subsets are crucial ingredients for gaining an understanding of the data. It is about iterative learning, from one graphic to the next until an informative presentation is found, or as Tukey (1977) said “It is important to understand what you can do before you learn to measure how well you seem to have done it”.

However, for a number of purposes graphics are not sufficient to describe a given data set. Here the realms of descriptive statistics are entered. Descriptive statistics are based on model assumptions about the data and thus more restrictive than EDA. A typical model assumption used in descriptive statistics would be that the data follow a normal distribution. The normal distribution is characterised by a typical bell shape (see Figure 4.1 upper left) and depends on two parameters, mean and variance (Gauss, 1809). Many natural phenomena are described by a normal distribution. Thus this distribution is often used as the basic assumption for statistical methods and estimators. Statisticians commonly assume that the data under investigation are a random selection of many more possible observations that altogether follow a normal distribution. Many formulae for statistical calculations, e.g., for mean, standard deviation and correlation are based on a model. It is always possible to use the empirical data at hand and the given statistical formula to calculate “values”, but only if the data follow the model will the values be representative, even if another random sample is taken. If the distribution of the samples deviates from the shape of the model distribution, e.g., the bell shape of the normal distribution, statisticians will often try to use transformations that force the data to approach a normal distribution. For environmental data a simple log-transformation of the data will often suffice to approach a normal distribution. In such a case it is said that the data come from a lognormal distribution.


No comments:

Post a Comment