[time-nuts] Discarding outliers in two dimensions

Thu Dec 10 10:00:04 UTC 2009

Hal,

> For one dimension, you sort, compute the average, then 
> compute the distance 
> of the first and last samples from the average.  Discard the 
> one that is 
> farther from the average

....Well, may work... A Method of ourlier search in one dimension that has
worked for me very well in the last years is (it is not my idea but comes
from an article covering robust statistics):

First you have to understand that the usual arithmetic average and the
standard deviation are measures that are NOT robust against outliers and
that you need to substitute them by robust measures when you need their
functionality.

1) Sort Data in ascending order.

2) Find the "center" of the sorted data, i.e. the data value where 50 % of
all values are greater or equal and the other 50 % are smaller or equal the
specific value. This value is called the "median" or "50% percentile".
Imagine it as a substitute for the average that is VERY robust.

3) Now (similar as with the standard deviation) compute the absolute values
of the differences of all data points and the median.

4) Again order the resulting values in ascending order and find their
median.

5) What you have now is the median deviation of the data to the original
median and is a very robust measure of the width of the distribution. There
is even a "norming" factor (that I do not remember because I do not need it)
that makes this number directly comparable to the standard deviation of
(outlier free) data. 99% of all data of a Gaussian distribution are inside
+/- 3 sigma, so if a data value is outside say +/- 5 median deviation then
it is very likely a outlier.

However, what you really want ist a outlier free average value. The median
itself is a single data value containing all the noise that you want to
average out. For this purpose robust statistics holds a different (but
similar) tool: The IQR (Inner Quartile Range). The Algo is:

1) Sort data in ascending order

2) Find the median of the data, the 50 % percentile but in addition also
find the 25% percentile and the 75 % percentile.

3) Now you have 4 groups (quartiles) of data, divided by the 3 percentiles.
Ignore the outer quartiles (where outliers are located) and compute the
arithmetic average over the two inner two quartiles which are free of
outliers if at least 50 % of all data are NOT outliers. 

The IQR is a robust compromise between outlier removement and noise
removement.

For the two dimension case I would suggest the following:

1) For all computations keep the index of the data points with you so that a
data point can be identified later.

2) Sort data in ascending order separate for the two dimensions.

3) Identify the inner quartiles separate for the two dimensions.

4) Now search for indices that are contained in BOTH inner quartiles, i.e.
data that has NOT been sorted out as a outlier in one of the dimensions.

4) Compute the arithmetic average over the data points found in 4)

Best regards
Ulrich

> -----Ursprungliche Nachricht-----
> Von: time-nuts-bounces at febo.com 
> [mailto:time-nuts-bounces at febo.com] Im Auftrag von Hal Murray
> Gesendet: Mittwoch, 9. Dezember 2009 11:53
> An: time-nuts at febo.com
> Betreff: [time-nuts] Discarding outliers in two dimensions
> 
> 
> 
> Suppose I want to average a bunch of samples.  Sometimes it 
> helps to discard 
> the outliers.  I think that helps when there are two noise 
> mechanisms, say 
> the typical Gaussian plus sometimes some other noise added 
> on.  If the other 
> noise is rare but large, those occasional samples can have a 
> big influence on 
> the average.  So discarding those outliers gives better 
> results, for some 
> value of "better".
> 
> I know how to do it in one dimension.  How do I do it in two 
> dimensions?
> 
> Say I have a lot of samples from a GPS system and I want to 
> compute the best 
> position to use when shifting into timing mode.
> 
> 
> For one dimension, you sort, compute the average, then 
> compute the distance 
> of the first and last samples from the average.  Discard the 
> one that is 
> farther from the average.
> 
> The problem with two dimensions is I don't know how to sort.
> 
> Let's ignore efficiency.  I can compute the average without 
> sorting.  I can 
> scan the whole list looking for the one that is farthest 
> (radial distance) 
> from the average.  Does that work (and do what I want)?  (I 
> think so, but I'm 
> not sure.)
> 
> Is there a way to do that efficiently?
> 
> 
> -- 
> These are my opinions, not necessarily my employer's.  I hate spam.
> 
> 
> 
> 
> _______________________________________________
> time-nuts mailing list -- time-nuts at febo.com
> To unsubscribe, go to 
> https://www.febo.com/cgi-bin/mailman/listinfo/time-nuts
> and follow the instructions there.