height weight age salary
user a: 176 70 20 5110
user b: 172 65 23 5300
we just compute the distance or pearson correlation based on these numeric values.
but nowadays, we usually run into nonnumeric values on social networks. e.g
for facebook app-what do you want:
user a: girl friend, money, car, job
user b: car, house
one method we may employ is that:
girl friend, money, car, job, house
user a 1 1 1 1 0
user b 0 0 0 1 1
however, if the number of items of user a is too large(it is very possible), then user b's data seems to be sparse, and we are likely to need more space to store these data.
another method is to use Tanimoto Coefficient:
the Tanimoto Coefficient uses the ratio of the intersecting set to the union set as the measure of similarity. Represented as a mathematical equation:
In this equation, N represents the number of attributes in each object (a,b). C in this case is the intersection set.
If we use python (In fact, we do usually use python in data mining) , we can perform this as following:
# Inputs: two lists # Output: the Tanimoto Coefficient def tanimoto (list1, list2): intersection = [common_item for common_item in list1 if common_item in list2] return float(len(c))/(len(a) + len(b) - len(c))
then the value can be used to cluster users.
