Just a simple R script that reads a CSV file made up of 2 columns, and outputs which elements are only in the first column, which elements are only in the second column, and which elements are in both (note that it won’t count elements but will remove duplicates). Pretty simple actually (except that I tried to vectorize it a bit, so it should be a bit optimized), I’m only posting it as a backup.
Here’s the script:
myData=read.csv("MyListComparisonData.csv"); g1=as.character(unique(myData[,1])); # we'll name the first list (first column) g1. For some reason it can be interpreted as a factor (maybe because we had some repetitions, thus the need for "unique" to improve speed), so we specify "as.character" g2=as.character(unique(myData[,2])); # second list: g2 gcommon=c(); # initialize vector for storing common elements gonlyG1=c(); # initialize vector for storing elements specific to g1 gonlyG2=c(); # initialize vector for storing elements specific to g2 # we'll loop over g1 and see if the current element can be found in g2 repeat{ if(length(which(g2==g1[1]))>0){ # vectorized way to check if one or more elements of g2 match the current one gcommon[length(gcommon)+1]=g1[1]; # if yes, we add it to the "common" list g2=g2[-which(g2==g1[1])]; # then we remove it from g2 } else{gonlyG1[length(gonlyG1)+1]=g1[1];} # otherwise, we add it to the "g1 only" list g1=g1[-1]; # then in any case we remove it from g1 if(length(g1)==0) break; # we're done emptying g1, so move on } gonlyG2=g2; # because we removed from g2 all common elements with g1, anything left if specific to g2 # fill with void the columns which are too small nRows=max(length(gcommon),length(gonlyG1),length(gonlyG2))+1; gcommon[nRows]="";gonlyG1[nRows]="";gonlyG2[nRows]=""; # now that all vectors have same length, we can put them into a dataframe output=data.frame(gcommon,gonlyG1,gonlyG2); # and write to disk write.csv(output,"comparaison_result.csv",quote=TRUE,sep=",",row.names=FALSE,col.names=TRUE);
And here’s a sample CSV input file:
"elementA","elementB" "elementA","elementB" "elementB","elementC"
The result for this input should be: elementA is only in first column, elementC is only in second column, elementB is in both columns.
0 Responses
Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.