A method according to the present invention enables the similarity between sequences of symbols to be determined using rules generated from a dictionary-based compression scheme according to the content of the columns from databases. Primers hybridizing to regions flanking these biallelic markers are also provided. The method further includes determining a value of a weighting factor based on the activity data.