Statistics help required > Math And Science

Posted: 2/12/2015 8:23:50 PM EDT

Is there a statistical test to determine if a value is an outlier for a nonparametric data set?

I don't think I can use the Wilcoxon ranked pair test or the kruskal wallis test since it tests two populations rather than a single point against a population.

Posted: 2/12/2015 9:07:11 PM EDT

[#1]

In my graduate work in stats we had just conquered analysis of covariance and were venturing into the realm of multivariate analysis.... Sorry man...

Posted: 2/14/2015 1:18:15 AM EDT

[#2]

Check here: http://www.eng.tau.ac.il/~bengal/outlier.pdf

There's a pretty simple technique described here which relies on quartiles...so it'll work as long as your data aren't that complicated: http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_Nonparametric/BS704_Nonparametric2.html#whenthereareoutliers

Posted: 2/14/2015 2:04:01 PM EDT

[#3]

Do you know what distribution your data set follows?

Posted: 2/18/2015 3:53:54 PM EDT

[#4]

Whoops, lost track of this thread.

Data set is most often left skewed, but can vary to be bimodal or even semi-normal.

Anyways, apparently what people sometimes do is log transform the data set to make it normal, and then perform an ANOVA test.

Or as the 2nd post linked: use an IQR range to detect outliers.

.. now back to debugging my R script....

ETA: for interested: bootstrapping

Posted: 2/18/2015 9:50:36 PM EDT

[#5]

when we were analyzing microarray data, we dumped the ones that were outside the +/- 2xSD.....??? I think. been a year or 2 since I had to do QC.
Otherwise, i just plot points and they tend to jump out and yell "me, me... I'm the wierdo!!"

Enjoy the R.......

Posted: 2/19/2015 2:50:08 PM EDT

[#6]

Quote History

Quoted:

when we were analyzing microarray data, we dumped the ones that were outside the +/- 2xSD.....??? I think. been a year or 2 since I had to do QC.

Otherwise, i just plot points and they tend to jump out and yell "me, me... I'm the wierdo!!"

Enjoy the R.......

View Quote

That's very naughty of you. Generally it is verboten to dump data points without a good reason. What you are doing will get rid of ~5% of your data points that are not actually outliers.

Posted: 2/22/2015 10:33:17 PM EDT

[#7]

The data "dump" was done because we were dealing with microarray data... which was/is a bit of a P.I.T.A.
The method we used was a generally accepted protocol at the time.
Nowadays, the procedures have changed a bit and I wouldn't dump as much data.
I was told by a nationally renowned statistician when we first started doing the arrays that if we gave him a set of array data, he could tell if they were created on the same day, if we used the same samples, if different technicians did them, if it was under a full moon..... Basically, they're VERY tricky to replicate and sensitive to all sorts of stuff...

But yah, I know dumping data is generally not the best thing to do. I'll dump 2 or 3 points out of 1000, but more than that gives me the heeby-jeebies nowadays.

Posted: 3/11/2015 6:32:59 PM EDT

[#8]

Anyone here know about hierarchy clustering methods?

I think I've arrived at the right distance/dissimilarity matrix to use (Bray-Curtis), but I'm not sure what clustering method I should choose.

If someone can point me in the right direction for reference material, that'd be great.

Otherwise, I have a 2D matrix, the rows are individual bacterial species. The columns are different pools of antibodies. Each cell represents how much of a particular bacteria species a antibody clone enriches for. I expect most clones to be different. But I want to visualize that by heatmap analysis.

That's what I have so far, but I'm not satisfied with the clustering algorithm (currently used Ward and Average)

Posted: 4/8/2015 7:52:48 PM EDT

[#9]

Usually when I do clustering, it's on 2D or 3D presumably metric spaces.

If you've got a metric space, you can use clustering based on the Euclidean distances among the points. If you've got an ultrametric space, you would want to use hierarchical clustering.

If the similarity matrix is based on features (which it sounds like it is) then the space is likely ultrametric but there are ways of checking this. In that case, at least in my field, we'd use traditional hierarchical clustering methods (R's hclust package), or an additive similarity tree (see Tversky & Sattath 1977)

Posted: 4/9/2015 8:37:08 PM EDT

[#10]

Quote History

Quoted:

Usually when I do clustering, it's on 2D or 3D presumably metric spaces.

If you've got a metric space, you can use clustering based on the Euclidean distances among the points. If you've got an ultrametric space, you would want to use hierarchical clustering.

If the similarity matrix is based on features (which it sounds like it is) then the space is likely ultrametric but there are ways of checking this. In that case, at least in my field, we'd use traditional hierarchical clustering methods (R's hclust package), or an additive similarity tree (see Tversky & Sattath 1977)

View Quote

Hmm, I seem to be missing something:

When I do Hierarchical clustering, it seems I still have to input a distance matrix.

For my data set, I use Bray-Curtis dissimilarity or the UNIFRAC distances. And then from there, I ask R to give me a cluster tree based on one of several clustering methods: average, point, complete or Ward method.

My trouble right now is I don't know which one of those methods is most biologically relevant. (I've determined that average/point doesn't make sense, and I'm stuck between complete and Ward(

Posted: 4/10/2015 12:04:53 AM EDT

[#11]

Quote History

Quoted:

Hmm, I seem to be missing something:

When I do Hierarchical clustering, it seems I still have to input a distance matrix.

For my data set, I use Bray-Curtis dissimilarity or the UNIFRAC distances. And then from there, I ask R to give me a cluster tree based on one of several clustering methods: average, point, complete or Ward method.
My trouble right now is I don't know which one of those methods is most biologically relevant. (I've determined that average/point doesn't make sense, and I'm stuck between complete and Ward(

View Quote View All Quotes

View All Quotes

Quote History

Quoted:

Quoted:
Usually when I do clustering, it's on 2D or 3D presumably metric spaces.

If you've got a metric space, you can use clustering based on the Euclidean distances among the points. If you've got an ultrametric space, you would want to use hierarchical clustering.

If the similarity matrix is based on features (which it sounds like it is) then the space is likely ultrametric but there are ways of checking this. In that case, at least in my field, we'd use traditional hierarchical clustering methods (R's hclust package), or an additive similarity tree (see Tversky & Sattath 1977)

Hmm, I seem to be missing something:

When I do Hierarchical clustering, it seems I still have to input a distance matrix.

For my data set, I use Bray-Curtis dissimilarity or the UNIFRAC distances. And then from there, I ask R to give me a cluster tree based on one of several clustering methods: average, point, complete or Ward method.
My trouble right now is I don't know which one of those methods is most biologically relevant. (I've determined that average/point doesn't make sense, and I'm stuck between complete and Ward(

100% - you still generate a distance matrix from that space. The issue is whether or not the space is metric. If it's not metric, and it's ultrametric, then hierarchical clustering makes sense. If it's not ultrametric then you might try an additive trees type approach.

So this is outside of your field of study but see this article:

http://www.cs.technion.ac.il/~moran/COURSES/papers/SaTv77.pdf

Warning

Confirm Action

[ARCHIVED THREAD] - Statistics help required

[ARCHIVED THREAD] - Statistics help required