Posted: 2/12/2015 8:23:50 PM EDT
|
Is there a statistical test to determine if a value is an outlier for a nonparametric data set? I don't think I can use the Wilcoxon ranked pair test or the kruskal wallis test since it tests two populations rather than a single point against a population. |
|
Check here: http://www.eng.tau.ac.il/~bengal/outlier.pdf
There's a pretty simple technique described here which relies on quartiles...so it'll work as long as your data aren't that complicated: http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_Nonparametric/BS704_Nonparametric2.html#whenthereareoutliers |
|
Whoops, lost track of this thread. Data set is most often left skewed, but can vary to be bimodal or even semi-normal. Anyways, apparently what people sometimes do is log transform the data set to make it normal, and then perform an ANOVA test. Or as the 2nd post linked: use an IQR range to detect outliers. .. now back to debugging my R script.... ![]() ETA: for interested: bootstrapping |
|
Quoted: when we were analyzing microarray data, we dumped the ones that were outside the +/- 2xSD.....??? I think. been a year or 2 since I had to do QC. Otherwise, i just plot points and they tend to jump out and yell "me, me... I'm the wierdo!!" Enjoy the R....... |
|
The data "dump" was done because we were dealing with microarray data... which was/is a bit of a P.I.T.A.
The method we used was a generally accepted protocol at the time. Nowadays, the procedures have changed a bit and I wouldn't dump as much data. I was told by a nationally renowned statistician when we first started doing the arrays that if we gave him a set of array data, he could tell if they were created on the same day, if we used the same samples, if different technicians did them, if it was under a full moon..... Basically, they're VERY tricky to replicate and sensitive to all sorts of stuff... But yah, I know dumping data is generally not the best thing to do. I'll dump 2 or 3 points out of 1000, but more than that gives me the heeby-jeebies nowadays. |
|
Anyone here know about hierarchy clustering methods? I think I've arrived at the right distance/dissimilarity matrix to use (Bray-Curtis), but I'm not sure what clustering method I should choose. If someone can point me in the right direction for reference material, that'd be great. Otherwise, I have a 2D matrix, the rows are individual bacterial species. The columns are different pools of antibodies. Each cell represents how much of a particular bacteria species a antibody clone enriches for. I expect most clones to be different. But I want to visualize that by heatmap analysis. That's what I have so far, but I'm not satisfied with the clustering algorithm (currently used Ward and Average) |
|
Usually when I do clustering, it's on 2D or 3D presumably metric spaces.
If you've got a metric space, you can use clustering based on the Euclidean distances among the points. If you've got an ultrametric space, you would want to use hierarchical clustering. If the similarity matrix is based on features (which it sounds like it is) then the space is likely ultrametric but there are ways of checking this. In that case, at least in my field, we'd use traditional hierarchical clustering methods (R's hclust package), or an additive similarity tree (see Tversky & Sattath 1977) |
|
Quoted: Usually when I do clustering, it's on 2D or 3D presumably metric spaces. If you've got a metric space, you can use clustering based on the Euclidean distances among the points. If you've got an ultrametric space, you would want to use hierarchical clustering. If the similarity matrix is based on features (which it sounds like it is) then the space is likely ultrametric but there are ways of checking this. In that case, at least in my field, we'd use traditional hierarchical clustering methods (R's hclust package), or an additive similarity tree (see Tversky & Sattath 1977) Hmm, I seem to be missing something: When I do Hierarchical clustering, it seems I still have to input a distance matrix. For my data set, I use Bray-Curtis dissimilarity or the UNIFRAC distances. And then from there, I ask R to give me a cluster tree based on one of several clustering methods: average, point, complete or Ward method. My trouble right now is I don't know which one of those methods is most biologically relevant. (I've determined that average/point doesn't make sense, and I'm stuck between complete and Ward( |
|
Quoted:
Hmm, I seem to be missing something: When I do Hierarchical clustering, it seems I still have to input a distance matrix. For my data set, I use Bray-Curtis dissimilarity or the UNIFRAC distances. And then from there, I ask R to give me a cluster tree based on one of several clustering methods: average, point, complete or Ward method. My trouble right now is I don't know which one of those methods is most biologically relevant. (I've determined that average/point doesn't make sense, and I'm stuck between complete and Ward( Quoted:
Quoted:
Usually when I do clustering, it's on 2D or 3D presumably metric spaces. If you've got a metric space, you can use clustering based on the Euclidean distances among the points. If you've got an ultrametric space, you would want to use hierarchical clustering. If the similarity matrix is based on features (which it sounds like it is) then the space is likely ultrametric but there are ways of checking this. In that case, at least in my field, we'd use traditional hierarchical clustering methods (R's hclust package), or an additive similarity tree (see Tversky & Sattath 1977) Hmm, I seem to be missing something: When I do Hierarchical clustering, it seems I still have to input a distance matrix. For my data set, I use Bray-Curtis dissimilarity or the UNIFRAC distances. And then from there, I ask R to give me a cluster tree based on one of several clustering methods: average, point, complete or Ward method. My trouble right now is I don't know which one of those methods is most biologically relevant. (I've determined that average/point doesn't make sense, and I'm stuck between complete and Ward( 100% - you still generate a distance matrix from that space. The issue is whether or not the space is metric. If it's not metric, and it's ultrametric, then hierarchical clustering makes sense. If it's not ultrametric then you might try an additive trees type approach. So this is outside of your field of study but see this article: http://www.cs.technion.ac.il/~moran/COURSES/papers/SaTv77.pdf |
In my graduate work in stats we had just conquered analysis of covariance and were venturing into the realm of multivariate analysis.... Sorry man...

