Saturday, March 9, 2013

clustering when user's data is nonnumeric

In data mining, we usually need to cluster using kmenas, hierarchical clustering, pearson correlation... if the user's data is numeric, it is very easy, e.g.

           height  weight  age  salary
user a:  176      70       20    5110
user b:   172      65      23    5300

we just compute the distance or pearson correlation based on these numeric values.

but nowadays, we usually run into nonnumeric values on social networks. e.g

for facebook app-what do you want:

user a: girl friend,  money,  car, job
user b: car, house

one method we may employ is that:

           girl friend, money,  car, job, house
user a        1             1         1     1     0
user b        0             0         0     1     1

however, if the number of items of user a is too large(it is very possible), then user b's data seems to be sparse, and we are likely to need more space to store these data.

another method is to use Tanimoto Coefficient:

the Tanimoto Coefficient uses the ratio of the intersecting set to the union set as the measure of similarity. Represented as a mathematical equation:




In this equation, N represents the number of attributes in each object (a,b). C in this case is the intersection set.

If we use python (In fact, we do usually use python in data mining)  , we can perform this as following:

  
# Inputs: two lists
# Output: the Tanimoto Coefficient
def tanimoto (list1, list2):
  intersection = [common_item for common_item in list1 if common_item in list2]
  return float(len(c))/(len(a) + len(b) - len(c))  

then the value can be used to cluster users.




 

No comments:

Post a Comment