Skip to content


Comparing 2 lists in R

Just a simple R script that reads a CSV file made up of 2 columns, and outputs which elements are only in the first column, which elements are only in the second column, and which elements are in both (note that it won’t count elements but will remove duplicates). Pretty simple actually (except that I tried to vectorize it a bit, so it should be a bit optimized), I’m only posting it as a backup.

Here’s the script:

myData=read.csv("MyListComparisonData.csv");
g1=as.character(unique(myData[,1])); # we'll name the first list (first column) g1. For some reason it can be interpreted as a factor (maybe because we had some repetitions, thus the need for "unique" to improve speed), so we specify "as.character"
g2=as.character(unique(myData[,2])); # second list: g2
gcommon=c(); # initialize vector for storing common elements
gonlyG1=c(); # initialize vector for storing elements specific to g1
gonlyG2=c(); # initialize vector for storing elements specific to g2

# we'll loop over g1 and see if the current element can be found in g2
repeat{
  if(length(which(g2==g1[1]))>0){ # vectorized way to check if one or more elements of g2 match the current one
    gcommon[length(gcommon)+1]=g1[1]; # if yes, we add it to the "common" list
    g2=g2[-which(g2==g1[1])]; # then we remove it from g2
    
  }
  else{gonlyG1[length(gonlyG1)+1]=g1[1];} # otherwise, we add it to the "g1 only" list
  g1=g1[-1]; # then in any case we remove it from g1
  if(length(g1)==0) break; # we're done emptying g1, so move on
}
gonlyG2=g2; # because we removed from g2 all common elements with g1, anything left if specific to g2

# fill with void the columns which are too small
nRows=max(length(gcommon),length(gonlyG1),length(gonlyG2))+1;
gcommon[nRows]="";gonlyG1[nRows]="";gonlyG2[nRows]="";
# now that all vectors have same length, we can put them into a dataframe
output=data.frame(gcommon,gonlyG1,gonlyG2);
# and write to disk
write.csv(output,"comparaison_result.csv",quote=TRUE,sep=",",row.names=FALSE,col.names=TRUE);

And here’s a sample CSV input file:

"elementA","elementB"
"elementA","elementB"
"elementB","elementC"

The result for this input should be: elementA is only in first column, elementC is only in second column, elementB is in both columns.

Posted in R (R-project).


0 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.



Some HTML is OK

or, reply to this post via trackback.

Please solve the CAPTCHA below in order to fight spamWordPress CAPTCHA