algorithmes d'inférence de profils utilisateurs sur les réseaux sociaux ...

RECHERCHE EN VUE DE L'OBTENTION DE LA MAÎTRISE ÈS SCIENCES EN. INFORMATIQUE .... Extension de PGPI+ pour évaluer les attributs numériques .
2MB taille 5 téléchargements 216 vues
ALGORITHMES D’INFÉRENCE DE PROFILS UTILISATEURS SUR LES RÉSEAUX SOCIAUX, UTILISANT UN GRAPHE SOCIAL PARTIEL

THÈSE PRÉSENTÉE À LA FACULTÉ DES ÉTUDES SUPÉRIEURES ET DE LA RECHERCHE EN VUE DE L’OBTENTION DE LA MAÎTRISE ÈS SCIENCES EN INFORMATIQUE

RAÏSSA YAPAN DOUGNON

DÉPARTEMENT D’INFORMATIQUE FACULTÉ DES SCIENCES UNIVERSITÉ DE MONCTON

Septembre 2015

COMPOSITION DU JURY

Président du jury : Julien Chiasson Professeur d’informatique, Université de Moncton Examinateur externe : Usef Faghihi Professeur d’informatique, Cameron University, États-Unis Examinateur interne : Mustapha Kardouchi Professeur d’informatique, Université de Moncton Directeur de thèse : Philippe Fournier-Viger Professeur d’informatique, Université de Moncton

REMERCIEMENTS

Mes remerciements les plus sincères vont à Philippe Fournier-Viger, mon directeur de thèse. Tout au long de ce travail, Philippe à toujours été disponible et enthousiaste, et il m’a beaucoup aidée. Il a su me guider, me soutenir et surtout m’a encouragée dans ma recherche. Merci Philippe, pour ta générosité en matière de formation et d’encadrement en recherche. À mes parents Wifried et Angèle DOUGNON pour leur amour et leur attachement indéfectible à ma réussite sur le chemin de la vie. À mes soeurs et mon frère, dont j’espère atteindre le seuil de vos espérances. Merci pour le soutien moral et l’encouragement que vous m’avez accordés. Un profond Merci à Harouna M Coulibaly, pour son amour constant, inconditionnel ainsi que son support et encouragement tout au long de mes études. À mon oncle et ami Hamadoun Baba Touré qui a su m’apporter son soutien aux moments propices, mes reconnaissances et remerciements chaleureux à toi. Finalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger.

ii

TABLE DES MATIÈRES

RÉSUMÉ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii RÉSUMÉ (anglais) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ix

INTRODUCTION GÉNÉRALE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

CHAPITRE I PGPI : ALGORITHME D’INFÉRENCE DE PROFILS UTILISATEURS SUR LES RÉSEAUX SOCIAUX EN UTILISANT UN GRAPHE SOCIAL PARTIEL . . . . . . . . . .

5

1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

2

Travaux antérieurs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

3

Définition du problème . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

4

L’algorithme proposé . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

4.1

Inférence de profils utilisateurs en utilisant les noeuds et les liens d’amitiés . . .

12

4.2

Inférence de profils utilisateurs en utilisant les groupes et les publications . . . .

13

4.3

Inférence de profils utilisateurs en utilisant les noeuds, les liens, les groupes et les publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

5

Évaluation experimentale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

6

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

CHAPITRE II PGPI+ : ALGORITHME AMÉLIORÉ D’INFÉRENCE DE PROFILS UTILISATEURS SUR LES RÉSEAUX SOCIAUX À PARTIR D’UN GRAPHE SOCIAL PARTIEL . . . . .

25

1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

2

Travaux antérieurs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

3

Définition du problème . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

4

L’algorithme proposé PGPI+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

4.1

L’algorithme PGPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

4.2

Optimisation de la couverture et de l’exactitude . . . . . . . . . . . . . . . . . .

33

4.3

Extension de PGPI+ pour évaluer les attributs numériques . . . . . . . . . . . .

34

4.4

Extension de PGPI+ pour évaluer la certitude des prédictions . . . . . . . . . . .

35

5

Évaluation expérimentale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

6

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43

CONCLUSION GÉNÉRALE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

46

RÉFÉRENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

TABLE OF CONTENTS

ABSTRACT (French) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ix

GENERAL INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

CHAPTER I PGPI: INFERRING USER PROFILES IN SOCIAL NETWORKS USING A PARTIAL SOCIAL GRAPH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

2

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

3

Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

4

The Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

4.1

Inferring user profiles using nodes and links . . . . . . . . . . . . . . . . . . . .

12

4.2

Inferring user profiles using groups and publications . . . . . . . . . . . . . . .

13

4.3

Inferring user profiles using nodes, links, groups and publications . . . . . . . .

15

5

Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

6

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

CHAPTER II PGPI+: ACCURATE SOCIAL NETWORK USER PROFILING UNDER THE CONSTRAINT OF A PARTIAL SOCIAL GRAPH . . . . . . . . . . . . . . . . . . . . . . .

25

1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

2

Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

3

Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

4

The proposed PGPI+ algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

4.1

The PGPI algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

4.2

Optimizations to improve accuracy and coverage . . . . . . . . . . . . . . . . .

33

4.3

Extension to handle numerical attributes . . . . . . . . . . . . . . . . . . . . . .

34

4.4

Extension to evaluate the certainty of predictions . . . . . . . . . . . . . . . . .

35

5

Experimental evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

6

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43

GENERAL CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

46

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

LISTE DES TABLEAUX

I-1

Best accuracy/coverage results for each attribute . . . . . . . . . . . . . . . . . . . .

24

II-1

Best accuracy results for nominal attributes on Facebook . . . . . . . . . . . . . . .

41

II-2

Best accuracy results for nominal attributes on Pokec . . . . . . . . . . . . . . . . .

42

II-3

Average error and standard deviation for numeric attributes . . . . . . . . . . . . . .

43

II-4

Best accuracy for nominal attributes using the full social graph . . . . . . . . . . . .

44

v

LISTE DES FIGURES

I-1

Product of accuracy and coverage w.r.t. number of accessed facts . . . . . . . . . . .

17

I-2

Accuracy and coverage w.r.t. number of accessed facts . . . . . . . . . . . . . . . .

18

I-3

Number of accessed facts w.r.t. maxF acts . . . . . . . . . . . . . . . . . . . . . . .

19

I-4

Product of accuracy and coverage w.r.t. ratioF acts . . . . . . . . . . . . . . . . . .

20

I-5

Product of accuracy and coverage (left) and number of accessed facts (right) w.r.t. maxDistance for PGPI-N . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

21

II-1

Accuracy w.r.t. number of accessed facts for nominal attributes . . . . . . . . . . . .

39

II-2

Average error, standard deviation and coverage w.r.t. certainty for the "year" attribute

44

II-3

Average error, standard deviation and coverage w.r.t. certainty for the "age" attribute .

45

vi

LISTE DES ALGORITHMES

1

The PGPI-N algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

2

The PGPI-G algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

3

The PGPI algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

4

The PGPI algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

33

5

Certainty assessment of a nominal attribute value . . . . . . . . . . . . . . . . . . .

37

vii

RÉSUMÉ

Les réseaux sociaux en ligne occupent une place importante dans la société d’aujourd’hui. Un problème important pour le ciblage publicitaire dans les réseaux sociaux est que les utilisateurs y divulguent souvent peu d’informations personnelles. Pour pallier ce problème, un certain nombre d’algorithmes ont été proposés pour inférer automatiquement les profils des utilisateurs sur la base d’informations publiques. Cependant, ces algorithmes présentent néanmoins tous au moins deux des quatre limites qui sont : (1) ils supposent que le graphe social complet est disponible, (2) ils ne pas exploitent le contenu riche disponible sur certains réseaux sociaux tel que l’appartenance à des groupes et les "J’aime", (3) ils traitent les attributs numériques comme des attributs nominaux et (4) ils n’évaluent pas la certitude des prédictions. Afin de pallier (1) et (2), nous présentons un algorithme appelé Partial Graph Profile Inference (PGPI). Cet algorithme permet l’inférence des profils utilisateurs en l’absence d’un graphe social complet. PGPI ne requiert aucun entraînement pour être appliqué, et permet de contrôler finement le compromis entre la quantité d’informations à laquelle ont peut accédée pour faire une inférence et l’exactitude de l’inférence. De plus, PGPI est conçu pour utiliser des informations variées sur les utilisateurs (liens d’amitié, profils utilisateurs, appartenance à des groupes d’intérêt, et les « vues » et les « J’aime ») présents sur certains réseaux sociaux comme Facebook. Afin de pallier les limites (3) et (4), nous présentons une version améliorée de PGPI appelé PGPI+ (Partial Graph Profile Inference+). Cette version ajoute la prise en compte des attributs numériques à celle des attributs nominaux. Il permet également l’évaluation de la certitude des prédictions. Une évaluation expérimentale avec plus de 30,000 profils utilisateurs des réseaux sociaux Facebook et Pokec montrent que PGPI/PGPI+ infèrent les profils d’utilisateurs avec une plus grande exactitude que plusieurs approches données dans la littérature. Nos deux approches accédent également à moins d’information du graphe social. De plus, un résultat intéressant est que certains attributs comme le statut (étudiant/professeur) et le genre peuvent être prédits avec plus de 95% d’exactitude. Mots clés : réseaux sociaux, inférence, profils utilisateurs, graphe partiel

viii

ABSTRACT

Online social networks have an important place in today’s society. An important problem for ad targeting on social networks is that users often do not reveal much personal information. To address this issue, various algorithms have been proposed to automatically infer user profiles using the publicly disclosed information. Nevertheless, all these algorithms suffer from at least two of the following limitations: (1) to assume that the full social graph is available for training, (2) to not exploit the rich information available on some social networks such as group memberships, "views" and "likes", (3) to treat numeric attributes as nominal attributes and (4) to not assess the certainty of predictions. To address limitations 1 and 2, we propose an algorithm named Partial Graph Profile Inference (PGPI). The PGPI algorithm can accurately infer user profiles under the constraint of a partial social graph. PGPI does not require training, and it lets the user select the trade-off between the amount of information used for inference and the accuracy of predictions. Besides, PGPI is designed to use rich information about users when available: user profiles, friendship links, group memberships, and the "views" and "likes" from social networks such as Facebook. Moreover, to also address limitations 3 and 4, we present an improved version of PGPI named PGPI+ (Partial Graph Profile Inference +). PGPI+ consider numeric attributes in addition to nominal attributes, and can evaluate the certainty of its predictions. An experimental evaluation with more than 30,000 user profiles from the Facebook and Pokec social networks shows that PGPI/PGPI+ predicts user profiles with an accuracy higher than several start-of-the-art algorithms, and by accessing less information from the social graph. Furthermore, an interesting result is that some profile attributes such as the status (student/professor) and genre can be predicted with more than 95 % accuracy using PGPI+. Keywords: social networks, inference, user profiles, partial graph

ix

INTRODUCTION GÉNÉRALE

Dans la société d’aujourd’hui, les réseaux sociaux en ligne sont extrêmement populaires. Une étude récente indique qu’il y a 3.025 milliards d’internautes à travers le monde, que 2.060 milliards sont actifs sur les réseaux sociaux, soit 68% des internautes et 28% de la population mondiale, et qu’un utilisateur de réseaux sociaux y passe en moyenne deux heures par jour [15]. Il existe divers types de réseaux sociaux comme les réseaux d’amitié (notamment, Facebook et Google plus), les réseaux professionnels (par exemple, ResearchGate et Linkedin) et les réseaux liés à des intérêts spécifiques (par exemple, Flickr). Un problème important pour le ciblage publicitaire dans les réseaux sociaux est due au fait que les utilisateurs partagent généralement peu d’informations publiquement [1, 12]. Afin de remédier à ce problème, plusieurs chercheurs du domaine de l’extraction de connaissances dans les réseaux sociaux (social network mining) s’intéressent maintenant au développement d’algorithmes automatiques pour inférer des profils utilisateurs détaillés en utilisant les informations divulguées publiquement par les utilisateurs [3, 6, 8, 11, 13, 14]. L’inférence du profil d’un utilisateur consiste à découvrir les valeurs des attributs de son profil tels que : son opinion politique, son lieu de résidence, son âge, ses traits de personnalité, son orientation sexuelle, son origine ethnique, etc. Afin de résoudre ce problème, les chercheurs ont proposé plusieurs algorithmes d’inférence de profil utilisateur dont les classificateurs relationnels naïfs de Bayes [14], la propagation d’étiquettes [11, 13], le vote majoritaire [6], la régression linéaire [12], l’allocation latente de Dirichlet [3] et la détection de communautés [16]. Ces différents algorithmes permettent la prédiction des attributs privés des profils utilisateurs. Cependant, ils souffrent tous d’au moins une des quatre limites importantes suivantes. La première limite vient du fait que la grande majorité des algorithmes supposent avoir à leur disposition le graphe complet du réseau social ou une grande partie pour l’entrainement [11]. Ceci est une hypothèse pratique en recherche lorsque des chercheurs souhaitent comparer la performance d’algorithmes en utilisant un grand ensemble de données d’entrainement. Or, dans la réalité, le 1

graphe social complet est généralement indisponible ou peut être très coûteux à obtenir ou à mettre à jour [3,8]. Ainsi, plusieurs de ces algorithmes peuvent être inadaptés ou fournissent des prédictions avec une exactitude faible. Quelques approches comme le vote majoritaire [6] ne supposent pas un graphe social complet. Par contre, elles se limitent à l’exploration des voisins immédiats d’un noeud et ne laissent pas la possibilité à l’utilisateur de décider du compromis entre le nombre de noeuds auxquels on peut accéder dans le graphe social et l’exactitude des prédictions, ce qui peut mener à des prédictions ayant une faible exactitude. Concernant la seconde limite, nous remarquons que la plupart de ces algorithmes n’utilisent ou ne considèrent pas l’information riche souvent disponible sur les réseaux sociaux. D’une part, plusieurs de ces algorithmes considèrent les liens entre les utilisateurs et les différents attributs des profils utilisateurs, mais ils ne considèrent pas d’autres informations sur l’utilisateur comme les adhésions à des groupe, et les "J’aime" et les "vues" qui sont disponibles sur les réseaux sociaux comme Facebook [1, 6, 9, 11, 13, 14, 16]. D’autre part, plusieurs de ces algorithmes considèrent la similarité entre les profils utilisateurs et les informations comme les "J’aime", mais pas les liens d’amitié entre les utilisateurs et d’autres types d’information disponibles sur les réseaux sociaux [3, 12, 20]. La troisième limite est due au fait que plusieurs de ces algorithmes n’offre pas de procédure d’inférence spécifique aux attributs numériques. Ils traitent donc les attributs numériques (par exemple, l’âge) comme un attribut nominal [1, 6], ce qui peut entrainer une perte d’exactitude pour l’inférence du profil d’un utilisateur. Certains de ces algorithmes sont spécialement conçus pour traiter des attributs numériques, mais supposent tout de même l’entraînement avec un graphe social complet ou une grande partie du graphe, ce qui n’est pas pratique [7, 11, 12, 16]. La quatrième limite importante est la non-évaluation de la certitude des prédictions. En effet, peu d’approches évaluent la certitude des prédictions alors que cette information est essentielle pour déterminer si une prédiction est fiable. Par exemple, dans le contexte du ciblage publicitaire, si la certitude qu’une prédiction soit correcte est faible, il est préferable de ne pas utiliser la prédiction pour le ciblage publicitaire [4], car une publicité ciblant une différente catégorie d’utilisateurs pourrait être présentée.

2

Objectif. Le principal objectif de ce travail est de proposer un nouvel algorithme pour l’inférence de profils utilisateurs dans les réseaux sociaux qui pallie les limites susmentionnées, c’est-àdire (1) qui laisse l’utilisateur décider du compromis entre la quantité d’information à utiliser pour une prédiction et son exactitude (2) qui tient compte de l’information riche qu’on trouve dans les réseaux sociaux comme les adhésions à des groupes, et les "J’aime" et les "vues" de Facebook, (3) le traitement des attributs numériques et (4) l’évaluation de la certitude des prédictions. De plus, il est souhaitable que l’algorithme offre un taux d’exactitude élevé comparativement aux algorithmes de la littérature, tout en accédant à un nombre limité de noeuds du graphe social. Hypothèse. Ce travail repose sur les hypothèses suivantes : — Un algorithme qui prédit les valeurs d’attributs d’un profil utilisateur dans un réseau social en utilisant un graphe social partiel peut offrir une exactitude comparable ou supérieure à celle de plusieurs approches de la littérature qui utilisent un graphe social complet. — Utiliser un algorithme paresseux (sans entraînement) permettra de laisser à l’utilisateur décider de la quantité d’information à utiliser pour l’inférence et l’exactitude des prédictions. — Utiliser une information riche sur les utilisateurs (information sur l’adhésion aux groupes, les "J’aime" et les "vues") permettra d’obtenir des prédictions avec une exactitude plus élevée, lorsque cette information est disponible. — Traiter les attributs numériques des utilisateurs (par exemple, l’âge, le poids et la taille) de façon différente des attributs nominaux, permettra également d’obtenir des prédictions plus exactes. — Un algorithme peut être conçu pour l’inférence des profils utilisateurs en exploitant un graphe social partiel, qui permet d’évaluer la certitude des prédictions. Organisation de la thèse. Cette thèse est une thèse par publications. Les chapitres 2 et 3 contiennent respectivement deux articles publiés dans des actes de conférences internationales d’intelligence artificielle. — Le premier article, « Inferring User Profiles in Online Social Networks using a Partial Social Graph », a été publié dans les actes de AI 2015 : The 28th Canadian Conference on Artificial Intelligence. L’article propose un algorithme nommé PGPI qui résout les deux premières limites mentionnées précédemment. PGPI permet d’inférer des profils utilisateurs en ex3

ploitant un graphe social partiel et des informations riches sur les utilisateurs comme leurs adhésions à des groupes, les "J’aime" et les "vues". Les résultats expérimentaux montrent que PGPI offre une exactitude supérieure à plusieurs algorithmes de la littérature. De plus, PGPI permet de prédire certains attributs tels que le statut et le genre sont avec une exactitude supérieure à 90 %. — Le deuxième article, « PGPI+ : Accurate Social Network User Profiling under the Constraint of a Partial Social Graph » a été accepté pour publication dans les actes de MICAI 2015 : The 14th edition of the Mexican International Conference on Artificial Intelligence Intelligence. Cet article présente une version améliorée de PGPI nommée PGPI+. PGPI+ introduit de nouvelles optimisations pour augmenter l’exactitude des prédictions. De plus, il offre le traitement des attributs numériques de façon particulière et l’évaluation de la certitude des prédictions. Il est à noter que la deuxième publication a été légèrement altérée pour ajouter des détails qui avaient été omis par manque d’espace dans la version publiée. Finalement, le chapitre 4 de cette thèse présente la conclusion générale ainsi que des perspectives de travaux futurs.

4

CHAPITRE I PGPI: INFERRING USER PROFILES IN SOCIAL NETWORKS USING A PARTIAL SOCIAL GRAPH PGPI: ALGORITHME D’INFÉRENCE DE PROFILS UTILISATEURS SUR LES RÉSEAUX SOCIAUX EN UTILISANT UN GRAPHE SOCIAL PARTIEL

Abstract Most algorithms for user profile inference in online social networks assume that the full social graph is available for training. This assumption is convenient in a research setting. However, in real-life, the full social graph is generally unavailable or may be very costly to obtain or update. Thus, several of these algorithms may be inapplicable or provide poor accuracy. Moreover, current approaches often do not exploit all the rich information that is available in social networks. In this paper, we address these challenges by proposing an algorithm named PGPI (Partial Graph Profile Inference) to accurately infer user profiles under the constraint of a partial social graph and without training. It is to our knowledge, the first algorithm that let the user control the trade-off between the amount of information accessed from the social graph and the accuracy of predictions. Moreover, it is also designed to use rich information about users such as group memberships, views and likes. An experimental evaluation with 11,247 Facebook user profiles shows that PGPI predicts user profiles more accurately and by accessing a smaller part of the social graph than four state-of-the-art algorithms. Moreover, an interesting result is that profile attributes such as status (student/professor) and gender can be predicted with more than 90% accuracy using PGPI. Keywords: social networks, inference, user profiles, partial graph

1

Introduction In today’s society, online social networks have become extremely popular. Various types of

social networks are used such as friendship networks (e.g. Facebook and Twitter), professional networks (eg. LinkedIn and ResearchGate) and networks dedicated to specific interests (e.g. Flickr and IMDB). A concern that is often raised about social networks is how to protect users’ personal information [10]. To address this concern, much work has been done to help users select how their personal information is shared and with whom, and to propose automatic anonymization techniques [10]. The result is that personal information is better protected but also that less information is now available to companies to provide targeted advertisements and services to their users. To address this issue, an important sub-field of social network mining is now interested in developing automatic techniques to infer user profiles using the disclosed public information. 6

Various methods have been proposed to solve this problem such as relational Naïve Bayes classifiers [14], label propagation [11, 13], majority voting [6] and approaches based on linear regression [12], Latent-Dirichlet Allocation [3] and community detection [16]. It was shown that these methods can accurately predict hidden attributes of user profiles in many cases. However, these work suffers from two important limitations. First, the great majority of these approaches assumes the unrealistic assumption that the full social graph is available for training (e.g. label propagation). This is a convenient assumption used by researchers for testing algorithms using large public datasets. However, in real-life, the full social graph is often unavailable (e.g. on Facebook, LinkedIn and Google Plus) or may be very costly to obtain [3]. Moreover, even if the full social graph is available, it may be unpractical to keep it up to date. The result is that several of these methods are inapplicable or provide poor accuracy if the full social graph is not available (see the experiment section of this paper). A few approaches does not assume a full social graph such as majority-voting [6]. However, this latter is limited to only exploring immediate neighbors of a node, and thus do not let the user control the trade-off between the number of nodes accessed and prediction accuracy, which may lead to low accuracy. It is thus an important challenge to develop algorithms that let the user choose the best trade-off between the number of nodes accessed and prediction accuracy. Second, another important limitation is that several algorithms do not consider the rich information that is generally available on social networks. On one hand, several algorithms consider links between users and user attributes but do not consider other information such as group memberships, "likes" and "views" that are available on social networks such as Facebook and Twitter [1, 6, 9, 11, 13, 14, 16]. On the other hand, several algorithms consider the similarity between user profiles and information such as "likes" but not the links between users and other rich information [3, 12, 20]. But exploiting more information may help to further increase accuracy. In this paper, we address these challenges. Our contributions are as follows. First, we propose a new lazy algorithm named PGPI (Partial Graph Profile Inference) that can accurately infer user profiles under the constraint of a partial graph and without training. The algorithm is to our knowledge, the first algorithm that let the user select how many nodes of the social graph can be accessed to infer a user profile. This lets the user choose the best trade-off between accuracy versus number 7

of nodes visited. Second, the algorithm is also designed to be able to use not only information about friendship links and profiles but also group memberships, likes and views, when the information is available. Third, we report results from an extensive empirical evaluation against four state-of-the-art algorithms when applied to 11,247 Facebook user profiles. Results show that the proposed algorithm can provide a considerably higher accuracy while accessing a much smaller number of nodes from the social graph. Moreover, an interesting result is that profile attributes such as status (student/professor) and gender can be predicted with more than 90% accuracy using PGPI. The rest of this paper is organized as follows. Section 2, 3, 4, 5 and 6 respectively presents the related work, the problem definition, the proposed algorithm, the experimental evaluation and the conclusion.

2

Related Work Several work have been done to infer user profiles on online social networks. We briefly review

recent work on this topic. Davis Jr et al. [6] inferred location of Twitter users based on relationships to other users with known locations. They used a dataset of about 50,000 users. Their approach consists of performing a majority vote over the locations of directly connected users. An important limitation is that this approach is designed to predict a single attribute based on a single attribute from other user profiles. Having the same goal, Jurgens [11] designed a variation of the popular label propagation approach to predict user locations on Twitter/Foursquare social networks. A major limitation is that it is an iterative algorithm that requires the full social graph since it propagates known labels to unlabeled nodes through links between nodes. Li et al. [13] developed an iterative algorithm to deduce LinkedIn user profiles based on relation types. The algorithm is similar to label propagation and requires a large training set to discover relation types. He et al. [9] proposed an approach consisting of building a Bayesian network based on the full social graph to then predict user attribute values. The approach considers similarity between user profiles and links between users to perform predictions, and was applied to data collected from LiveJournal. Recently, Dong et al. [7] proposed a similar approach based on using graphical-models 8

to predict the age and gender of users. Their study was performed with 1 billion phone and SMS data and 7 millions user profiles. Chaudhari [4] also proposed a graphical model-based approach to infer user profiles, designed to only perform predictions when the probability of making an accurate prediction is high. The approach has shown high accuracy on a dataset of more than 1M Twitter users and a datasets from the Pokac social network. A limitation of these approaches however, it that they all assumes that the full social graph is available for training. Furthermore, they only consider user attributes and links but do not consider additional information such as likes, views and group membership. A community detection based approach for user profile inference was proposed by Mislove [16]. It was applied on more than 60K Facebook profiles with friendship links. It consists of applying a community detection algorithm and then to infer user profiles based on similarity to other members of the same community. Lindamood et al. [14] also inferred user profiles from Facebook data. They applied a modified Naïve Bayes classifier on 167K Facebook profiles with friendship links, and concluded that if links or attributes are erased, accuracy of the approach can greatly decrease. This raises the challenges of performing accurate predictions using few data. Recently, Blenn et al. [1] utilized bird flocking, association rule mining and statistical analysis to infer user profiles in a dataset of 3 millions Hyves.nl users. However, all these work assume that the full social graph is available for training and they only use profile information and social links to perform predictions. Chaabane et al. [3] proposed an approach based on Latent Dirichlet Allocation (LDA) to infer Facebook user profiles. The approach extracts a probabilistic model from music interests and additional information provided from Wikipedia. A drawback is that it requires a large training dataset, which was very difficult and time-consuming to obtain according to the authors [3]. Kosinski et al. [12] also utilized information about user preferences to infer Facebook user profiles. The approach consists of applying Singular Value Decomposition to a large matrix of users/likes and then applying regression to perform prediction. A limitation of this work is that it does not utilize information about links between users and requires a very large training dataset. Quercia et al. [20] developed an approach to predict the personality of Twitter users. The ap-

9

proach consists of training a decision tree using a large training sets of users tagged with their personality traits, and then use it to predict personality traits of other users. Although this approach was successful, it uses a very limited set of information to perform predictions: number of followers, number of persons followed and list membership count.

3

Problem Definition As outlined above, most approaches assume a large training set or full social graph. The most

common definition of the problem of inferring user profiles is the following [1, 4, 9, 11, 13, 14, 16]. Definition 1. Social graph. A social graph G is a quadruplet G = {N, L, V, A}. N is the set of nodes in the social graph. L ⊆ N × N is a binary relation representing the link (edges) between nodes. Let be m attributes to describe users of the social network such that V = {V1 , V2 , ...Vm } contains for each attribute i, the set of possible attribute values Vi . Finally, A = {A1 , A2 , ...Am } contains for each attribute i a relation assigning an attribute value to nodes, that is Ai ⊆ N × Vi . Example. Let be a social graph with three nodes N = {John, Alice, M ary} and friendship links L = {(John, M ary), (M ary, John), (M ary, Alice), (Alice, M ary)}. Consider two attributes gender and status, respectively called attribute 1 and 2 to describe users. The set of possible attribute values for gender and status are respectively V1 = {male, f emale} and V2 = {prof essor, student}. The relations assigning attributes values to nodes are A1 = {(John, male), (Alice, f emale), (M ary, f emale)} and A2 = {(John, student), (Alice, student), (M ary, prof essor)}. Definition 2. Problem of inferring user profiles in a social graph. The problem of inferring the user profile of a node n ∈ N in a social graph G is to correctly guess the attribute values of n using the other information provided in the social graph. We also consider an extended problem definition where we consider additional information available in social networks such as Facebook (views, likes and group memberships). Definition 3. Extended social graph. An extended social graph E is a tuple E = {N, L, V, A, G, N G, P, P G, LP, V P } where N, L, V, A are defined as previously. G is a set 10

of groups that a user can be a member of. The relation N G ⊆ N × G indicates the membership of users to groups. P is a set of publications such as pictures, texts, videos that are posted in groups. P G is a relation P G ⊆ P × G, which associates a publication to the group(s) where it was posted. LP is a relation LP ⊆ N × P , which indicates the publication(s) liked by each user (e.g. the "likes" on Facebook). V P is a relation V P ⊆ N × P , which indicates the publication(s) viewed by each user (e.g. the "views" on Facebook). An observation is that LP ⊆ V P . Example. Let be two groups G = {book_club, music_lovers} such that N G = {(John, book_club), (M ary, book_club), (Alice, music_lovers)}. Let be two publications P = {picture1, picture2} published in the groups P G = {(picture1, book_club), (picture2, music_lovers)}. The publications viewed by users are V P = {(John, picture1), (M ary, picture1), (Alice, picture2)} while the publications liked by users are LP = {(John, picture1), (Alice, picture2)}. Definition 4. Problem of inferring user profiles in an extended social graph. The problem of inferring the user profile of a node n ∈ N in an extended social graph E is to correctly guess the attribute values of n using the information in the social graph. But the above definitions assume that the full social graph may be used to perform predictions. In this paper, we define the problem of inferring user profiles using a limited amount of information as follows. Definition 5. Problem of inferring user profiles using a partial (extended) social graph. Let maxF acts ∈ N+ be a parameter set by the user. The problem of inferring the user profile of a node n ∈ N using a partial (extended) social graph E is to accurately predict the attribute values of n by accessing no more than maxF acts facts from the social graph. A fact is a node, group or publication from N , G or P (excluding n).

4

The Proposed Algorithm In this section, we present the proposed PGPI algorithm. We first describe a version PGPI-N

that infer user profiles using only nodes and links. Then, we present an alternative version PGPIG designed for predicting user profiles using only group and publication information (views and

11

likes). Then, we explain how these two versions are combined in the full PGPI algorithm.

4.1

Inferring user profiles using nodes and links

Our proposed algorithm PGPI-N for inferring profiles using nodes and links is inspired by the label propagation family of algorithms, which was shown to provide high accuracy [11, 13]. These algorithms suffer however from an important limitation. They are iterative algorithms that require the full social graph for training to propagate attribute values [11]. PGPI-N adapts the idea of label propagation for the case where at most maxF acts facts from the social graph can be accessed to make a prediction. This is possible because PGPI-N is a lazy algorithm (it does not require training), unlike label propagation. To predict an attribute value of a node n, PGPI-N only explores the neighborhood of n, which is restricted by a parameter maxDistance. The PGPI-N algorithm (Algorithm 1) takes as parameter a node ni , an attribute k to be predicted, the maxF acts and maxDistance parameters and a social graph G. It outputs a predicted value v for attribute k of node ni . The algorithm first initializes a map M so that it contains a keyvalue pair (v, 0) for each possible value v for attribute k. The algorithm then performs a breadth-first search. It first initializes a queue Q to store nodes and a set seen to remember the already visited nodes, and the node ni is pushed in the queue. Then, while the queue is not empty and the number of accessed facts is less than maxF acts, the algorithm pops the first node nj in the queue. Then, the formula Fi,j = Wi,j /dist(ni , nj ) is calculated. Wi,j and dist(ni , nj ) are respectively used to weight the influence of node nj by its similarity and distance to ni . Wi,j is defined as the number of attribute values common to ni and nj divided by the number of attributes m. dist(x, y) is the number of edges in the shortest path between ni and nj . Then, Fi,j is added to the entry in map M for the attribute value of nj for attribute k. Then, if dist(x, y) ≤ maxDistance, each node nh linked to nj that was not already visited is pushed in the queue and added to the set seen. Finally, when the while loop terminates, the attribute value v associated to the largest value in M is returned as the predicted value for ni . Note that in our implementation, if the maxF acts limit is reached, the algorithm do not perform a 12

prediction to reduce the probability of making an inaccurate prediction. Algorithm 1: The PGPI-N algorithm input : ni : a node, k: the attribute to be predicted, maxF acts and maxDistance: the user-defined thresholds, G: a social graph output: the predicted attribute value v 1

M = {(v, 0)|v ∈ Vk };

2

Initialize a queue Q;

3

Q.push(ni );

4

seen = {ni };

5

while Q is not empty and |accessedF acts| < maxF acts do

6

nj = Q.pop();

7

Fi,j ← Wi,j /dist(ni , nj );

8

Add Fi,j to the entry of value v in M such that (nj , v) ∈ Ak ;

9

if dist(ni , nj ) ≤ maxDistance then foreach node nh 6= ni such that (nh , nj ) ∈ L and nh 6∈ seen do

10 11

Q.push(nh );

12

seen ← seen ∪ {nh }; end

13

end

14 15

end

16

return a value v such that (v, z) ∈ M ∧ 6 ∃(v 0 , z 0 ) ∈ M |z 0 > z;

4.2

Inferring user profiles using groups and publications

PGPI-G is a lazy algorithm designed to predict user attribute values using only group and publication information (views and likes). The algorithm is inspired by majority voting algorithms (e.g. [6]), which have been used for predicting user profiles based on links between users. PGPI-G adapts this idea for groups and publications and also to handle the constraint that at most maxF acts facts from the social graph can be accessed to make a prediction. 13

The PGPI-G algorithm (Algorithm 2) takes as parameter a node ni , an attribute k to be predicted, the maxF acts and maxDistance parameters and an extended social graph E. It outputs a predicted value v for attribute k of node ni . The algorithm first initializes a map M so that it contains a key-value pair (v, 0) for each possible value v for attribute k. Then, the algorithm iterates over each member nj 6= ni of each group g where ni is a member, while the number of accessed facts is less than maxF acts. For each such node nj , the formula F gi,j is calculated to estimate the similarity between ni and nj and the similarity between ni and g. In this formula, commonLikes(ni , nj ) and commonV iews(ni , nj ) respectively denotes the number of publications liked by both ni and nj , while commonGroups(ni , nj ) is the number of groups common to ni and nj . Finally, commonP opularAttributes(ni , g) is the number of attribute values of ni that are the same as the most popular attribute values for members of g. Then, F gi,j is added to the entry in map M for the attribute value of nj for attribute k. Finally, the attribute value v associated to the largest value in M is returned as the predicted value for ni . Note that in our implementation, if the maxF acts limit is reached, the algorithm do not perform a prediction to reduce the probability of making an inaccurate prediction. Algorithm 2: The PGPI-G algorithm input : ni : a node, k: the attribute to be predicted, maxF acts: a user-defined threshold, E: an extended social graph output: the predicted attribute value v 1

M = {(v, 0)|v ∈ Vk };

2

foreach group g|(ni , g) ∈ N G s.t. |accessedF acts| < maxF acts do

3

foreach person nj 6= ni ∈ g s.t. |accessedF acts| < maxF acts do F gi,j ← Wi,j × commonLikes(ni , nj ) × commonV iews(ni , nj ) ×

4

(commonGroups(ni , nj )/|{(ni , x)|(ni , x) ∈ N G}|)× 5

commonP opularAttributes(ni , g);

6

Add F gi,j to the entry of value v in M such that (nj , v) ∈ Ak ;

7

end

8

end

9

return a value v such that (v, z) ∈ M ∧ 6 ∃(v 0 , z 0 ) ∈ M |z 0 > z;

14

4.3

Inferring user profiles using nodes, links, groups and publications

The PGPI-N and PGPI-G algorithms have similar design. Both of them are lazy algorithms that update a map M containing key-value pairs and then return the attribute value v associated to the highest value in M as the prediction. Because of this similarity, the algorithms PGPI-N and PGPI-G can be easily combined. We name this combination PGPI. The pseudocode of PGPI is shown in Algorithm 3, and is obtained by inserting lines 2 to 15 of PGPI-N before line 8 of PGPI-G. Moreover, a new parameter named ratioF acts is added. It specifies how much facts of the maxF acts facts can be respectively used by PGPI-N and by PGPI-G to make a prediction. For example, ratioF acts = 0.3 means that PGPI-G may use up to 30% of the facts, and thus that PGPI-N may use the other 70%. In our experiment, the best value for this parameter was 0.5.

5

Experimental Evaluation We performed several experiments to assess the accuracy of the proposed PGPI-N, PGPI-G and

PGPI algorithms for predicting attribute values of nodes in a social network. Experiments were performed on a computer with a fourth generation 64 bit Core i5 processor running Windows 8.1 and 8 GB of RAM. We compared the performance of the proposed algorithms with four state-of-the-art algorithms. The three first are Naïve Bayes classifiers [10]. Naïve Bayes (NB) infer user profiles strictly based on correlation between attribute values. Relational Naïve Bayes (RNB) consider the probability of having friends with specific attribute values. Collective Naïve Bayes (CNB) combines NB and RNB. To be able to compare NB, RNB and CNB with the proposed algorithms, we have adapted them to work with a partial graph. This is simply done by training them with maxF acts users chosen randomly instead of the full social graph. The last algorithm is label propagation (LP) [11]. Because LP requires the full social graph and does not consider the maxF acts parameter, its results are only used as a baseline. For algorithm specific parameters, the best values have been empirically found to provide the best results. Experiments were carried on a real-life dataset containing 11,247 user profiles collected from

15

the Facebook social network in 2005 [23]. Each user is described according to seven attributes: a student/faculty status flag, gender, major, second major/minor (if applicable), dorm/house, year, and high school. These attributes respectively have 6, 2, 65, 66, 66, 15 and 1,420 possible values. The dataset is a social graph rather than an extended social graph, i.e it does not contain information about group memberships, views and likes. But this information is needed by PGPI-G and PGPI. To address this issue, we have generated synthetic data about group memberships, views and likes. Our data generator takes several parameters as input. To find realistic values for the parameters, we have observed more than 100 real groups on the Facebook network. Parameters are the following. We set the number of groups to 500 and each group to contain between 20 and 250 members, since we observed that most groups are small. Groups are randomly generated. But similar users are more likely to be member of a same group. Furthermore, we limit the number of groups per person to 25. We generated 10,000 publications that are used to represent content shared by users in groups such as pictures, text and videos. A publication may appear in three to five groups. A user has a 50% probability of viewing a publication and a 30% probability of liking a publication that he has viewed. Lastly, a user can like and view respectively no more than 50 and 100 publications, and a user is more likely to like a publication that users with a similar profile also like. To compare algorithms, the prediction of an attribute value is said to be a success if the prediction is accurate, a failure if the prediction is inaccurate or otherwise a no match if the algorithm was unable to perform a prediction. Three performance measures are used. The accuracy is the number of successful predictions divided by the total number of prediction opportunities. The coverage is the number of success or failures divided by the number of prediction opportunities. The product of accuracy and coverage (PAC) is the product of the accuracy and coverage and is used to assess the trade-off obtained for these two measures. In the following, the accuracy and coverage are measured as the average for all seven attributes, except when indicated otherwise. PAC w.r.t number of facts. We have run all algorithms while varying the maxF acts parameter to assess the influence of the number of accessed facts on the product of prediction accuracy and coverage. Fig. II-1 shows the overall results. In this figure, the PAC is measured with respect to the number of accessed facts. It can be observed that PGPI provides the best results when 132 to 700 facts are accessed. For less than 132 facts, CNB provides the best results followed by PGPI. 16

PGPI-N provides the second best results for 226 to 306 facts. Note that no results are provided for PGPI-N for more than 306 facts. The reason is that PGPI-N relies solely on links between nodes to perform predictions and the dataset does not contains enough links to let us further increase the number of facts accessed by this algorithm. 70%

Accuracy x coverage

60% 50% 40% 30% 20% 10%

PGPI

PGPI-G

PGPI-N

NB

RNB

CNB

0% 0

100

200 300 400 Number of accessed facts

500

600

700

Figure I-1 Product of accuracy and coverage w.r.t. number of accessed facts The algorithm providing the worst results is LP (not shown in figures). LP provides a maximum accuracy and coverage of respectively 43.2% and 100%. This is not good considering that LP uses the full social graph of more than 10,000 nodes. For the family of Naïve Bayes algorithms, CNB provides the best results until 90 facts are accessed. Then, NB is the best. RNB always provides the worst results among Naïve Bayes algorithms. Accuracy and coverage w.r.t number of facts. The left and right parts of Fig. I-2 separately show the results for accuracy and coverage. In terms of accuracy, it can be observed that PGPI-N, PGPI-G and PGPI always provides much higher accuracy than NB, RNB, CNB and LP (from 9.2% to 36.8% higher accuracy). Thus, the reason why PGPI algorithms have a lower PAC than NB for 132 facts or less is because of coverage and not accuracy. In terms of coverage, the coverage is only shown in Fig. I-2 for PGPI-N, PGPI-G and PGPI because the coverage of the other algorithms remains always the same (100% for NB, RNB and CNB, and 94.4% for LP). PGPI has a much higher coverage than PGPI-N and PGPI-G, even when it accesses much less facts. For example, PGPI has a coverage of 87% when it accesses 132 facts, while PGPI-N and PGPI-G have a similar 17

coverage when using respectively 267 and 654 facts. The coverage of PGPI exceeds 95% when using about 400 facts. 80%

100%

75% 80%

70%

Coverage

Accuracy

65%

60% 55%

60%

40%

50% 45%

20%

40% 35%

0% 0

100

200 300 400 500 Number of accessed facts

PGPI

PGPI-G

600

700

PGPI-N

0

100 NB

200 300 400 500 Number of accessed facts RNB

600

700

CNB

Figure I-2 Accuracy and coverage w.r.t. number of accessed facts Best results for each attributes. We also analyzed each attributes individually. The best results in terms of accuracy and coverage obtained for each attribute and each algorithm are shown in Table II-1. The last column of the table indicates the number of accessed facts to obtain these results. In terms of accuracy, the best accuracy was achieved by PGPI algorithms for all attributes. For status,gender, major, minor, residence, year and school, PGPI algorithms respectively obtained a maximum accuracy of 90.9%, 94.6%, 75.1%, 60.5%, 70.5% and 12.9%. The reason for the low accuracy for school is that there are 1,420 possible values and that users are rarely linked to other users with the same value. The best results in terms of PAC are written in bold. PGPI provides the best PAC for all but one attribute. The second best algorithm is PGPI-N, which provides the second or third best PAC for five of seven attributes. Moreover, it is interesting to see that PGPI-N achieves this result using less than half the amount of facts used by PGPI. The third best algorithm is PGPI-G, which provides the second or third best PAC for four attributes.

18

Influence of maxF acts and ratioF acts parameters. In the previous experiments, we have analyzed the relationship between the number of accessed facts and the PAC, accuracy and coverage. However, we did not discuss how to choose the maxF acts parameter and how it influences the number of accessed facts for each algorithm. For Naïve Bayes algorithms and LP, the maxF acts parameter determines the number of nodes to be used for training, and thus directly determines the number of accessed facts. For PGPI algorithms, we show the influence of maxF acts on the number of accessed facts in Fig. I-3. It can be observed that PGPI algorithms generally explore less nodes than maxF acts. For PGPI-N, the reason is that some social network users have less than maxF acts friends whithin maxDistance and thus, PGPI-N cannot use more than maxF acts facts to perform a prediction. For PGPI-G, the reason is that some users are members of just a few groups. For PGPI, the number of accessed facts is greater than for PGPI-N but less than for PGPI-G. The reason is that PGPI combines both algorithms and uses the ratioF acts parameter, as previously explained, to decide how much facts can be used by PGPI-N and PGPI-G. The value that we used in previous experiments for this parameter is 0.5, which means to use 50% of the facts for PGPI-G, and thus to use the other

Number of accessed informations

half for PGPI-N. 900 825 750 675 600 525 450 375 300 225 150 75 0

PGPI PGPI-N

0

150

300

450

600

750

900

1050

PGPI-G maxFacts

1200

1350

1500

maxFacts

Figure I-3 Number of accessed facts w.r.t. maxF acts A question that can be raised is how to set the ratioF acts parameter for the PGPI algorithms. 19

In Fig. I-4, we show the influence of the ratioF acts parameter on the product of accuracy and coverage. We have measured the PAC w.r.t the number of accessed facts when ratiof acts is respectively set to 0.3, 0.5 and 0.7. Note that setting ratioF acts to 0 and 1 would be respectively equivalent to using PGPI-N and PGPI-G. In Fig. I-4, it can be observed that setting ratioF acts to 0.5 allows to obtain the highest PAC. Furthermore, in general a value of 0.5 also gives a high PAC w.r.t the number of accessed facts. A value of 0.3 sometimes gives a higher PAC than 0.5. However, the highest PAC value for 0.3 on overall is lower than that for 0.5. A value of 0.7 almost always gives a lower PAC. 70%

Accuracy x coverage

60%

50% 40% 30%

PGPI ratioFacts = 0.3 PGPI ratioFacts = 0.5

20%

PGPI ratioFacts = 0.7 10% 0

100

200 300 400 Number of accessed informations

500

600

700

Figure I-4 Product of accuracy and coverage w.r.t. ratioF acts Best results when using the full social graph. We also performed an experiment using the full social graph (maxF acts = ∞). The best algorithm in this case is PGPI with an accuracy and coverage of 64.7% and 100%, while the second best algorithm (PGPI-G) achieves 62% and 98.4%. Influence of the maxDistance parameter. Recall that the PGPI-N and PGPI algorithms utilize a parameter called maxDistance. This parameter indicates that two nodes are in the same neighborhood if there are no more than maxDistance edges in the shortest path between them. In the previous experiments, we have empirically set this parameter to obtain the best results. We now discuss the influence of this parameter on the results of PGPI-N and PGPI. We have investigated the influence of maxDistance on the product of accuracy and coverage, 20

and also on the number of accessed facts. Results are shown in Fig. I-5 for PGPI-N. It can be observed that as maxDistance is increased, the PAC also increases until it reaches its peak at around maxDistance = 6. After that, it remains unchanged. The main reason why it remains unchanged is that the influence of a user on another user is divided by their distance in the PGPIN algorithm. Thus, farther nodes have a very small influence on each other. Another interesting observation is that as maxDistance increases, the number of accessed facts also increases. Thus, if our goal is to use less facts, then maxDistance should be set to a small value, whereas if our goal is to achieve the maximum PAC, maxDistance should be set to a large value. Lastly, we can also observe that the number of accessed facts does not increase after maxDistance = 7. The reason is

60%

350

55%

300

Number of accessed facts

Product of accuracy and coverage

that the Facebook dataset is quite sparse (i.e. users have few friendship links).

50% 45% 40% 35%

30%

250 200

150 100 50 0

1

2

3

4

5

6

7

8

9

10

1

maxDistance

2

3

4

5

6

7

8

9

10

maxDistance

Figure I-5 Product of accuracy and coverage (left) and number of accessed facts (right) w.r.t. maxDistance for PGPI-N Discussion. Overall, PGPI algorithms provide the best accuracy. A key reason is that they are lazy algorithms. They use local facts close to the node to be predicted rather than relying on a general training phase performed beforehand. A second important reason is that more rich information is used by PGPI-G and PGPI (groups, publications). A third reason is that PGPI-N calculates the similarity between users while label propagation does not. It is interesting to note that although PGPI-G relies on synthetic data in our experiment, PGPI-N does not and performs very well. 21

6

Conclusion We have proposed a lazy algorithm named PGPI that can accurately infer user profiles under

the constraint of a partial social graph. It is to our knowledge, the first algorithm that let the user control the amount of information accessed from the social graph and the accuracy of predictions. Moreover, the algorithm is designed to use not only node links and profiles as basis to perform predictions but also group memberships, likes and views, if the information is available. An extensive experimental evaluation against four state-of-the-art algorithms shows that PGPI can provide a considerably higher accuracy while accessing a much smaller number of nodes from the social graph to perform predictions. Moreover, an interesting result is that profile attributes such as status (student/professor) and gender can be predicted with more than 90% accuracy using PGPI. For future work, we plan to further improve the PGPI algorithms by proposing new optimizations and to perform experiments on additional datasets.

22

Algorithm 3: The PGPI algorithm input : ni : a node, k: the attribute to be predicted, maxF acts: a user-defined threshold, E: an extended social graph, ratioF acts: the ratio of facts to be used by PGPI-G output: the predicted attribute value v 1

M = {(v, 0)|v ∈ Vk };

2

foreach group g|(ni , g) ∈ N G s.t. |accessedF acts| < maxF actsxratioF acts do

3

foreach person nj 6= ni ∈ g s.t. |accessedF acts| < maxF acts do F gi,j ← Wi,j × commonLikes(ni , nj ) ×

4

commT hepseudocodeof P GP IisshowninF ig.3.onV iews(ni , nj ) × (commonGroups(ni , nj )/|{(ni , x)|(ni , x) ∈ N G}|)× 5

commonP opularAttributes(ni , g);

6

Add F gi,j to the entry of value v in M such that (nj , v) ∈ Ak ;

7

end

8

end

9

Initialize a queue Q;

10

Q.push(ni );

11

seen = {ni };

12

while Q is not empty and |accessedF acts| < maxF acts do

13

nj = Q.pop();

14

Fi,j ← Wi,j /dist(ni , nj );

15

Add Fi,j to the entry of value v in M such that (nj , v) ∈ Ak ;

16

if dist(ni , nj ) ≤ maxDistance then foreach node nh 6= ni such that (nh , nj ) ∈ L and nh 6∈ seen do

17 18

Q.push(nh );

19

seen ← seen ∪ {nh }; end

20 21

end

22

end

23

return a value v such that (v, z) ∈ M ∧ 6 ∃(v 0 , z 0 ) ∈ M |z 0 > z;

23

status

gender

major

minor

PGPI acc.

90.8%

87.7%

31.5%

75.1% 59.59%

66.86% 10.3%

684

PGPI cov.

100%

98.9%

99%

99%

99%

99%

99%

684

PGPI-N acc.

89.4%

90.9%

24.5%

73.9%

60.5%

66.77%

12.9%

272

PGPI-N cov. 94.4%

94.0%

94.4%

94.4%

94.4%

94.4%

94.4%

272

PGPI-G acc.

94.6%

35.2%

70.8%

53.0%

70.5%

8.8%

689

PGPI-G cov. 96.6%

72%

72%

72%

72%

72%

72%

689

NB acc.

88.0%

51.1%

16.6%

74.4%

55.6%

26.0%

10.6%

189

NB cov.

100%

100%

100%

100%

100%

100%

100%

189

RNB acc.

80.2%

57.7%

15.0%

74.2%

55.0%

26.2%

9.8%

580

RNB cov.

100%

100%

100%

100%

100%

100%

100%

580

CNB acc.

88.0%

47.3%

9.0%

74.0%

54.8.%

26.0%

10.6%

189

CNB cov.

100%

100%

100%

100%

100%

100%

100%

189

LP acc.

83.0%

50.4%

16.7%

56.3%

49.1%

35.3%

10.1%

10K

LP cov.

94.4%

94.4%

94.4%

94.4%

94.4%

94.4%

94.4%

10K

90.9%

residence year

Table I-1 Best accuracy/coverage results for each attribute

24

school

|f acts|

algorithm

CHAPITRE II PGPI+: ACCURATE SOCIAL NETWORK USER PROFILING UNDER THE CONSTRAINT OF A PARTIAL SOCIAL GRAPH PGPI+: ALGORITHME AMÉLIORÉ D’INFÉRENCE DE PROFILS UTILISATEURS SUR LES RÉSEAUX SOCIAUX À PARTIR D’UN GRAPHE SOCIAL PARTIEL

Abstract Algorithms for social network user profiling suffer from one or more of the following limitations: (1) assuming that the full social graph is available for training, (2) not exploiting the rich information that is available in social networks such as group memberships and likes, (3) treating numeric attributes as nominal attributes, and (4) not assessing the certainty of its predictions. In this paper, we address these challenges by proposing an improved algorithm named PGPI+ (Partial Graph Profile Inference+). PGPI+ accurately infers user profiles under the constraint of a partial social graph using rich information about users (e.g. group memberships, views and likes), handles nominal and numeric attributes, and assesses the certainty of predictions. An experimental evaluation with more than 30,000 user profiles from the Facebook and Pokec social networks shows that PGPI+ predicts user profiles with considerably more accuracy and by accessing a smaller part of the social graph than five state-of-the-art algorithms. Keywords: social networks, inference, user profiles, partial graph

1

Introduction Online social networks have become extremely popular. Various types of social networks are

used such as friendship networks (e.g. Facebook), professional networks (e.g. ResearchGate) and interest-based networks (e.g. Flickr). An important problem for ad targeting [25] and building recommender systems [5, 17, 19, 24, 27] that relies on social networks is that users often disclose few information publicly [1, 12]. To address this issue, an important sub-field of social network mining is now interested in developing algorithms to infer detailed user profiles using publicly disclosed information [2,21]. Various approaches have been used to solve this problem such as relational Naïve Bayes classifiers [14], label propagation [11,13], majority voting [6], linear regression [12], LatentDirichlet Allocation [3] and community detection [16]. It was shown that these approaches can accurately predict hidden attributes of user profiles in many cases. However, all these approaches suffer from at least two of the following four limitations. 1. Assuming a full social graph. Many approaches assume that the full social graph is avail26

able for training (e.g. [11]). However, in real-life, it is generally unavailable or may be very costly to obtain or update [3, 8]. A few approaches do not assume a full social graph such as majority-voting [6]. However, they do not let the user control the trade-off between the number of nodes accessed and prediction accuracy, which may lead to low accuracy. 2. Not using rich information. Several algorithms do not consider the rich information that is available on social networks. For example, several algorithms consider links between users and user attributes but do not consider other information such as group memberships, "likes" and "views" that are available on some social networks [1, 6, 11, 13, 14, 16]. 3. Not handling numeric attributes. Many approaches treat numeric attributes (e.g. age) as nominal attributes [1, 6], which may decrease inference accuracy. Others are designed to handle numeric attributes but requires the full social graph, which is often unpractical [7, 11, 12, 16]. 4. Not assessing certainty. Few approaches assess the certainty of their predictions. But this information is essential to determine if a prediction is reliable and actions should be taken based on the prediction. For example, if there is a low certainty that a user profile attribute is correctly inferred, it may be better to not use this attribute for ad targeting, rather than showing an ad that is targeted to a different audience [4]. To address limitations 1 and 2, the PGPI algorithm (Partial Graph Profile Inference) was recently proposed [8]. PGPI lets the user select how many nodes of the social graph can be accessed to infer a user profile, and can use not only information about friendship links and profiles but also about group memberships, likes and views, when available. In this paper, we present an extended version of PGPI named PGPI+ to also address limitations 3 and 4. Contributions are threefold. First, we design a new procedure for predicting values of numeric attributes. Second, we introduce a mechanism to assess the certainty of predictions for both nominal and numeric attributes. Third, we introduce four optimizations that considerably improve the overall prediction accuracy of PGPI. We report results from an extensive empirical evaluation against PGPI and four other state-ofthe-art algorithms, for more than 30,000 user profiles from the Facebook and Pokec social networks. Results show that the proposed PGPI+ algorithm can provide a considerably higher accuracy for

27

both numeric and nominal attributes while accessing a much smaller number of nodes from the social graph. Moreover, results show that the calculated certainty well assesses the reliability of predictions. Moreover, an interesting result is that profile attributes such as status (student/professor) and gender can be predicted with more than 95% accuracy using PGPI+. The rest of this paper is organized as follows. Section 2, 3, 4, 5 and 6 respectively presents the related work, the problem definition, the proposed algorithm, the experimental evaluation and the conclusion.

2

Related work We review recent work on social network user profile inference. Davis Jr et al. [6] inferred

locations of Twitter users by performing a majority vote over the locations of directly connected users. A major limitation of this approach is that a single attribute is considered. Jurgens [11] predicted locations of Twitter/Foursquare users using a label propagation approach. However, it is an iterative algorithm that requires the full social graph since it propagates known labels to unlabeled nodes through links between nodes. Li et al. [13] proposed an iterative algorithm to deduce LinkedIn user profiles based on relation types. This algorithm also requires a large training set to discover relation types. Mislove [16] applied community detection on more than 60k user profiles with friendship links, and then inferred user profiles based on similarity of members from the same community. Lindamood et al. [14] applied a Naïve Bayes classifier on 167k Facebook profiles with friendship links, and concluded that if links or attributes are erased, accuracy of the approach can greatly decrease. This study highlights the challenges of performing accurate predictions using few data. Recently, Blenn et al. [1] utilized bird flocking, association rule mining and statistical analysis to infer user profiles in a dataset of 3 millions Hyves.nl users. However, all these work assume that a large training set is available for training and they only use profile information and social links to perform predictions. Chaabane et al. [3] inferred Facebook user profiles using Latent Dirichlet Allocation (LDA) and majority voting. The approach extracts a probabilistic model from music interests and additional 28

information provided from Wikipedia. But it requires a large training set, which was difficult and time-consuming to obtain [3]. Kosinski et al. [12] also utilized information about user preferences to infer Facebook user profiles. Kosinski et al. applied Singular Value Decomposition to a huge matrix of users/likes and then used regression to perform prediction. A limitation of this work is that it does not utilize information about links between users and requires a very large training dataset. He et al. [9] proposed an approach consisting of building a Bayesian network based on the full social graph to then predict user attribute values. The approach considers similarity between user profiles and links between users to perform predictions, and was applied to data collected from LiveJournal. Recently, Dong et al. [7] used graphical-models to predict the age and gender of users. Their study was performed with 1 billion phone and SMS data and 7M user profiles. Chaudhari [4] also used graphical models to infer user profiles. The approach has shown high accuracy on datasets of more than 1M users from the Twitter and Pokec social networks. A limitation of these approaches however, it that they assume a large training set for training. Furthermore, they only consider user attributes and links but not additional information such as likes, views and group membership. Some of the above approaches infer numeric attributes, either by treating them as nominal attributes [1, 3, 6], or by using specific inference procedures. However, these latter require a large training set [7, 11, 12, 16]. Besides, current approaches generally do not assess the certainty of predictions. But this information is essential to determine if a prediction is reliable, and actions should be taken based on this prediction. To our knowledge, only Chaudhari [4] provides this information. However, this approach is designed to use the full social graph.

3

Problem definition The problem of user profiling is commonly defined as follows [1, 4, 11, 13, 14, 16]. Definition 1. Social graph. A social graph G is a quadruplet G = {N, L, V, A}. N is the set

of nodes in G. L ⊆ N × N is a binary relation representing the links (edges) between nodes. Let

29

be m attributes to describe users of the social network such that V = {V1 , V2 , ...Vm } contains for each attribute i, the set of possible attribute values Vi . Finally, A = {A1 , A2 , ...Am } contains for each attribute i a relation assigning an attribute value to nodes, that is Ai ⊆ N × Vi . Example. Let be a social graph with three nodes N = {T om, Amy, Lea} and friendship links L = {(T om, Lea), (Lea, T om), (Lea, Amy), (Amy, Lea)}. Consider two attributes gender and status, respectively called attribute 1 and 2 to describe users. The set of possible attribute values for gender and status are respectively V1 = {male, f emale} and V2 = {prof essor, student}. The relations assigning attributes values to nodes are A1 = {(T om, male), (Amy, f emale), (Lea, f emale)} and A2 = {(T om, student), (Amy, student), (Lea, prof essor)}. Definition 2. Problem of inferring user profiles in a social graph. The problem of inferring the user profile of a node n ∈ N in a social graph G is to guess the attribute values of n using the other information provided in G. The problem definition can be extended to consider additional information from social networks such as Facebook (views, likes and group memberships). Definition 3. Extended social graph. An extended social graph E is a tuple E = {N, L, V, A, G, N G, P, P G, LP, V P } where N, L, V, A are defined as previously. G is a set of groups that a user can be a member of. The relation N G ⊆ N × G indicates the membership of users to groups. P is a set of publications such as pictures, texts, videos that are posted in groups. P G is a relation P G ⊆ P × G, which associates a publication to the group(s) where it was posted. LP is a relation LP ⊆ N × P indicating publication(s) liked by each user (e.g. "likes" on Facebook). V P is a relation V P ⊆ N × P indicating publication(s) viewed by each user (e.g. "views" on Facebook), such that LP ⊆ V P . Example. Let be two groups G = {book_club, music_lovers} such that N G = {(T om, book_club), (Lea, book_club), (Amy, music_lovers)}. Let be two publications P = {picture1, picture2} published in the groups P G = {(picture1, book_club), (picture2, music_lovers)}. The publications viewed by users are V P = {(T om, picture1), (Lea, picture1), (Amy, picture2)} while the publications liked by users are LP = {(T om, picture1), (Amy, picture2)}.

30

Definition 4. Problem of inferring user profiles in an extended social graph. The problem of inferring the user profile of a node n ∈ N in an extended social graph E is to guess the attribute values of n using the information in E. But the above definitions assume that the full social graph may be used to perform predictions. The problem of inferring user profiles using a limited amount of information is defined as follows [8]. Definition 5. Problem of inferring user profiles using a partial (extended) social graph. Let maxF acts ∈ N+ be a parameter set by the user. The problem of inferring the user profile of a node n ∈ N using a partial (extended) social graph E is to accurately predict the attribute values of n by accessing no more than maxF acts facts from the social graph. A fact is a node, group or publication from N , G or P (excluding n). The above definition can be extended for numeric attributes. For those attributes, instead of aiming at predicting an exact attribute value, the goal is to predict a value that is as close as possible to the real value. Moreover, in this paper, we also extend the problem to consider the certainty of predictions. In this setting, a prediction algorithm must assign a certainty value in the [0,1] interval to each predicted value, such that a high certainty value indicates that a prediction is likely to be correct.

4

The proposed PGPI+ algorithm We next present the proposed PGPI+ algorithm. Subsection 4.1 briefly introduces PGPI. Then,

subsections 4.2, 4.3 and 4.4 respectively present optimizations to improve its prediction accuracy and coverage, and how it is extended to handle numerical attributes and assess the certainty of predictions.

4.1

The PGPI algorithm

The PGPI algorithm [8] is a lazy algorithm designed to perform predictions under the constraint of a partial social graph, where at most maxF acts facts from the social graph can be accessed to 31

make a prediction. PGPI (Algorithm 4) takes as parameter a node ni , an attribute k to be predicted, the maxF acts parameter, a parameter named maxDistance, and an (extended) social graph E. PGPI outputs a predicted value v for attribute k of node ni . To predict the value of an attribute k, PGPI relies on a map M . This map stores pairs of the form (v, f ), where v is a possible value v for attribute k, and f is positive real number called the weight of v. PGPI automatically calculates the weights by applying two procedures named PGPI-G and PGPI-N. These latter respectively update weights by considering the (1) views, likes and group memberships of ni , and (2) its friendship links. After applying these procedures, PGPI returns the value v associated to the highest weight in M as the prediction. In PGPI, half of the maxF acts facts that can be used to make a prediction are used by PGPI-G and the other half by PGPI-N. If globally the maxF acts limit is reached, PGPI does not perform a prediction. PGPI-N or PGPI-G can be deactivated. If PGPI-N is deactivated, only views, likes and group memberships are considered to make a prediction. If PGPI-G is deactivated, only friendship links are considered. In the following, we respectively refer to these versions of PGPI as PGPI-N and PGPI-G (and as PGPI-N+/PGPI-G+ for PGPI+). PGPI-N works as follows. To predict an attribute value of a node ni , it explores the neighborhood of ni restricted by the parameter maxDistance using a breadth-first search. It first initializes a queue Q and pushes ni in the queue. Then, while Q is not empty and the number of accessed facts is less than maxF acts, the first node nj in Q is popped. Then, Fi,j = Wi,j /dist(ni , nj ) is calculated. Wi,j = Ci,j /Ci , where Ci,j is the number of attribute values common to ni and nj , and Ci is the number of known attribute values for node ni . dist(x, y) is the number of edges in the shortest path between ni and nj . Then, Fi,j is added to the weight of the attribute value of nj for attribute k, in map M . Then, if dist(x, y) ≤ maxDistance, each unvisited node nh linked to nj is pushed in Q and marked as visited. PGPI-G is similar to PGPI-N. It is also a lazy algorithm. But it uses a majority voting approach to update weights based on group and publication information (views and likes). Due to space limitation, we do not describe it. The reader may refer to [8] for more details.

32

Algorithm 4: The PGPI algorithm input : ni : a node, k: the attribute to be predicted, maxF acts: a user-defined threshold, E: an extended social graph output: the predicted attribute value v 1

M = {(v, 0)|v ∈ Vk };

2

// Apply PGPI-G

3

// ...

4

// Apply PGPI-N

5

Initialize a queue Q and add ni to Q;

6

while Q is not empty and |accessedF acts| < maxF acts do

7

nj = Q.pop();

8

Fi,j ← Wi,j /dist(ni , nj );

9

Update (v, f ) as (v, f + Fi,j ) in M , where (nj , v) ∈ Ak ; if dist(ni , nj ) ≤ maxDistance then for each node nh 6= ni such that (nh , nj ) ∈ L and

10

nh is unvisited, push nh in Q and mark nh as visited ; 11

end

12

return a value v such that (v, z) ∈ M ∧ 6 ∃(v 0 , z 0 ) ∈ M |z 0 > z;

4.2

Optimizations to improve accuracy and coverage

In PGPI+, we redefine the formula Fi,j used by PGPI-N by adding three optimizations. The new formula is Fi,j = Wi,j × (Ti,j + 1)/newdist(ni , nj ) × R. The first optimization is to add the term Ti,j + 1, where Ti,j is the number of common friends between ni and nj , divided by the number of friends of ni . This term is added to consider that two persons having common friends (forming a triad) are more likely to have similar attribute values. The constant 1 is used so that if ni and nj have no common friends, Fi,j is not zero. The second optimization is based on the observation that the term dist(ni , nj ) makes Fi,j decrease too rapidly. Thus, nodes that are not immediate neighbors but were still close in the social graph had a negligible influence on their respective profile inference. To address this issue, dist(ni , nj ) is replaced by newdist(ni , nj ) = 3 − (0.2 × dist(ni , nj )), where it is assumed that 33

maxDistance < 15. It was empirically found that this formula provides higher accuracy. The third optimization is based on the observation that PGPI-G had too much influence on predictions compared to PGPI-N. To address this issue, we multiply the weights calculated using the formula Fi,j by a new constant R. This thus increases the influence of PGPI-N+ on the choice of predicted values. In our experiments, we have found that setting R to 10 provides the best accuracy. Furthermore, a fourth optimization is integrated in the main procedure of PGPI+. It is based on the observation that PGPI does not make a prediction for up to 50% of users when maxF acts is set to a small value [8]. The reason is that PGPI does not make a prediction when it reaches the maxF acts limit. However, it may have collected enough information to make an accurate prediction. In PGPI+, a prediction is always performed. This optimization was shown to greatly increase the number of predictions.

4.3

Extension to handle numerical attributes

The PGPI algorithm is designed to handle nominal attributes. In PGPI+, we performed the following modifications to handle numeric attributes. First, we modified how the predicted value is chosen. Recall that the value predicted by PGPI for nominal attributes is the one having the highest weight in M (line 12). However, for numeric attribute, this approach provides poor accuracy because few users have exactly the same attribute value. For example, for the attribute "weight", few users may have the same weight, although they may have similar weights. To address this issue, PGPI+ calculates the predicted values for numeric attributes as the weighed sum of all values in M. Second, we adapted the weighted sum so that it ignores outliers because if unusually large values are in M , the weighted sum provides inaccurate predictions. For example, if a young user has friendship links to a few 20 years old friends but also a link to his 90 years old grandmother, the prediction may be inaccurate. Our solution to this problem is to ignore values in M that have a weight more than one standard deviation away from the mean. In our experiment, it greatly improves prediction accuracy for numeric attributes.

34

Third, we change how Wi,j is calculated. Recall that in PGPI, Wi,j = Ci,j /Ci , where Ci,j is the number of attribute values common to ni and nj , and Ci is the number of known attribute values for node ni . This definition does not work well for numeric attributes because numeric attributes rarely have the same value. To consider that numeric values may not be equal but still be close, Ci,j is redefined as follows in PGPI+. The value Ci,j is the number of values common to ni and nj for nominal attributes, plus a value CNi,j,k for each numeric attribute k. The value CNi,j,k is calculated as (vi − vj )/αk if (vi − vj ) < αk , and is otherwise 0, where αk is a user-defined constant. Because CNi,j,k is a value in [0,1], numeric attributes may not have more influence than nominal attributes on Wi,j .

4.4

Extension to evaluate the certainty of predictions

PGPI+ also extends PGPI with the capability of calculating a certainty value CV (v) for each predicted value v. As previously mentioned, few inference algorithm assess the certainty of their predictions. However, this feature is desirable to know how good a prediction is and whether or not actions should be taken based on the prediction. For example, in the context of ad targeting, if there is a low certainty that a user profile attribute is correctly inferred, it may be better to not use this attribute for ad targeting, rather than showing an ad that is targeted to a different audience [4]. To address this issue, we have added the capability of calculating a certainty value CV (v) for each value v predicted by PGPI+. A certainty value is a value in the [0, 1] interval. A high certainty value indicates that a prediction is likely to be correct. Certainty values are calculated differently for numeric and nominal attributes because different inference procedures are used to predict attribute values. Certainty of numeric values. Recall that values of numeric attributes are predicted using a weighted sum. Thus, calculating the certainty value of a predicted value v for a numeric attribute k requires to find a way to assess how "accurate" the weighted sum for calculating v is. Intuitively, we can expect the weighted sum to be accurate if (1) the amount of information taken into account by the weighted sum is large, and (2) if values considered in the weighted sum are close to each

35

other. These ideas are captured in our approach by using the relative standard error. Let EM = {v1 , v2 , ...vp } be the set of values stored in the map M that were used to calculate the weighted sum. The amount of information used by the weighted sum is measured as the number of updates made by PGPI-N+/PGPI-G+ to the map M , denoted as updates. The relative standard error is de√ fined as RSE(v) = stdev(EM )/( updates × avg(EM )). The RSE is a value in the [0,1] interval that assesses how close the average of a sample might be to the average of the population. Because we want a certainty value rather than an error value, we calculate the certainty value of v as: √ CV (v) = 1 − RSE(v) = 1 − [stdev(EM )/( updates × avg(EM ))] . The proposed approach provides an assessment of the certainty of a prediction. However, a drawback of this RSE based approach is that it is sensible to outliers. To address this issue, we ignore values that are more than one standard deviation away from the mean when calculating CV (v). Certainty of nominal values. The procedure for calculating the certainty of a predicted value v for a nominal attribute is different than for numeric attributes since the inference procedure is not the same. Our idea is to evaluate how likely the weight z of value v in M is to be as large at it is, compared to other weights in M . To estimate this, we use a simulation-based approach where "larger" is defined in terms of a number of standard deviations away from the mean. The procedure for calculating the certainty of a value for a numeric attribute is shown in Algorithm 5. It takes as input a value v for an attribute k and the map of values/weights M . Let ZM = {z1 , z2 , ...zp } denotes the weights of values in the map M , and t(M, v) be the weight of v in map M . We initialize a variable count to 0 and perform 1,000 simulations. During the j-th simulation, the algorithm creates a new map B so that it contains a key-value pair (v, 0) for each possible value v for attribute k in map M . These weights are initialized to zero. Then, the algorithm performs updates random updates to B. A random update is performed as follows. First, a pair (v 0 , z 0 ) is randomly chosen from B. Then, a value x is randomly selected from the [0, 1] interval. Finally, (v 0 , z 0 ) is replaced by (v 0 , z 0 + x) in B. At the end of the j-th simulation, we increase the count variable by 1 if (f (B, v) − avg(ZB )) /stdev(ZB ) 36

≥ (f (M, v) − avg(ZM )) /stdev(ZM ). After performing the 1,000 simulations, the certainty value is calculated as CV (v) = count/1000, which gives a value in the [0,1] interval. This value CV (v) can be interpreted as follows. It represents what is the chance that the weight of v in map M is as far as it is in terms of standard deviation from the average weight of values in map M for the attribute to be predicted k. Thus, by the above mechanisms a certainty values can be provided for predicted values of both numeric and nominal attributes. Algorithm 5: Certainty assessment of a nominal attribute value input : M : the map of values/weights, v: an attribute value, k: the attribute, updates: the number of updates to M output: the certainty value CV (v) 1

count = 0;

2

for 1 to 1000 do

3

B = {(v, 0)|(v, z) ∈ M };

4

for 1 to updates do

5

Randomly choose a pair (v 0 , z 0 ) from B;

6

x = random([0, 1]);

7

Replace (v 0 , z 0 ) by (v 0 , z 0 + x) in B;

8

end

9

if (f (B, v) − avg(ZB )) /stdev(ZB ) ≥ (f (M, v) − avg(ZM )) /stdev(ZM ) then count = count + 1;

10 11

end

12

end

13

CV (v) = count/1000;

14

return CV(v);

37

5

Experimental evaluation We compared the accuracy of the proposed PGPI+, PGPI-N+ and PGPI-G+ algorithms with the

original PGPI, PGPI-N and PGPI-G algorithms and four additional state-of-the-art algorithms for predicting attribute values of nodes in a social network. The three first are Naïve Bayes classifiers [10]. Naïve Bayes (NB) infer user profiles strictly based on correlation between attribute values. Relational Naïve Bayes (RNB) consider the probability of having friends with specific attribute values. Collective Naïve Bayes (CNB) combines NB and RNB. To be able to compare NB, RNB and CNB with the proposed algorithms, we have adapted them to work with a partial graph. This is done by training them with maxF acts users chosen randomly instead of the full social graph. The last algorithm is label propagation (LP) [11]. Because LP requires the full social graph, its results are only used as a baseline. Each algorithm was tuned with optimal parameter values. Datasets. Two datasets are used. The first one is 11,247 user profiles collected from Facebook in 2005 [23]. Each user is described according to seven attributes: a student/faculty status flag, gender, major, second major/minor (if applicable), dorm/house, year, and high school, where year is a numerical attribute. The second dataset is 20,000 user profiles from the Pokec social network obtained at https://snap.stanford.edu/data/. It contains 17 attributes, including three numeric attributes: age, weight and height. Because both datasets do not contain information about groups, and this information is needed by PGPI-G and PGPI, synthetic data about groups was generated using the generator proposed in [8], using the same parameters. This latter generator is designed to generate group having characteristics similar to real-life groups. Accuracy for nominal attributes w.r.t number of facts. We first ran all algorithms while varying the maxF acts parameter to assess the influence of the number of accessed facts on accuracy for nominal attributes. The accuracy for nominal attributes is defined as the number of correctly predicted values, divided by the number of prediction opportunities. Fig. II-1 shows the overall results for the Facebook and Pokec datasets. Note that PGPI algorithms are not shown in these tables due to lack of space. It can be observed that PGPI+/PGPI-N+/PGPI-G+ provides the best results. For example, on Facebook, PGPI+ and PGPI-G+ provide the best results when 66 to 700 facts are accessed, and for less than 66 facts, PGPI-N+ provides the best results followed by PGPI+. No

38

results are provided for PGPI-N+ for more than 306/6 facts on Facebook/Pokec because PGPI-N+ relies solely on links between nodes to perform predictions and the datasets do not contains enough links. It is also interesting to note that PGPI-N+ only uses real data (contrarily to PGPI+/PGPI-G+) and still performs better than all other algorithms. The algorithm providing the worst results is LP (not shown in the figure). LP provides an accuracy of 43.2%/47.31% on Facebook/Pokec. This is not good considering that LP uses the full social graph of more than 10,000 nodes. For the family of Naïve Bayes algorithms, NB has the best overall accuracy. It can be further observed that the accuracy of PGPI+ algorithm is up to 34% higher than PGPI, which shows that proposed optimizations have a major influence on accuracy. Facebook 100%

Accuracy

90% 80% 70% 60% 50% 40% 30% 0

Pokec

100

200

300 400 Number of accessed facts

500

600

700

75%

Accuracy

70% 65% 60% 55% 50% 0 PGPI+

100 PGPI-G+

PGPI-N+

Number of accessed facts PGPI

PGPI-G

200 PGPI-N

300 NB

RNB

CNB

Figure II-1 Accuracy w.r.t. number of accessed facts for nominal attributes Best results for each nominal attribute. To better understand the previous results, a second experiment was performed were we analyzed the accuracy obtained for each nominal attribute separately. The goal of this analysis is to see how well each attribute can be predicted and also if some approaches perform better than others for some specific attributes. The best results in terms of accuracy for each attribute and algorithm for the Facebook and 39

Pokec are respectively shown in Table II-1 and II-2. The last row of each table indicates the number of accessed facts to obtain these results. Note that in these tables, the result for PGPI algorithms are not shown because they are always worse than those of the optimized PGPI+ algorithms. On overall, it can be observed that the best accuracy is in general achieved by PGPI+ algorithms for all attributes. In several cases, PGPI+ algorithms provide an accuracy that is more than 10% higher than the compared state-of-the art algorithm, and the gap can be greater than 30% in some cases. A second important observation is that PGPI+ algorithms allow to predict some attributes with a very high accuracy. For example, the gender of a Facebook user can be predicted with up to 95% accuracy by PGPI+. Another example is that PGPI-N+ can guess with up to 94% accuracy if a Pokec user speaks a language such as spanish or italian. The high accuracy that is obtained may be very useful for real applications of user profiling such as ad targeting. A third observation is that some attributes are more difficult to predict. For example, most algorithms have a relatively low accuracy for the school attribute on the Facebook dataset. A reason is that unlike binary attributes such as gender, the school attribute has a very high number of possible values (1,420 values, each corresponding to a possible high-school attended by a user). Another reason is that users are rarely linked to other users with the same value for the school attribute. Since in 2005, users on Facebook were mostly university students, it seems reasonable that users may often not come from the same high-school as their Facebook friends attending the same university. A similar phenomena occurs with the region attribute for the Pokec dataset, which indicates the region where a given user lives. Best results for each numeric attribute. Previous subsections presented accuracy results for nominal attributes. In this subsection, an experiment is performed to compare the accuracy of PGPI algorithms and PGPI+ algorithms for numeric attributes. The goal is to evaluate the improvement provided in PGPI+ for handling numeric attributes. Furthermore, we also compared accuracy obtained with the NB, RNB, CNB and LP algorithms. Because these later algorithms and PGPI treat numeric attributes as nominal attributes, they are used as baseline. Note that we attempted to also include the algorithm of Kosinski [12] in our comparison because it is designed for inferring nu40

attribute

PGPI+

PGPI-N+ PGPI-G+ NB

RNB

CNB

LP

status

92.0%

92.6 %

93.0%

88.0%

80.2% 88.0%

83.0%

gender

96.1%

84.7 %

95.8%

51.1%

57.7% 47.3%

50.4%

major

33.8%

30.4%

32.8%

16.6%

15.0% 9.0%

16.7%

minor

76.0%

76.4%

76.6%

74.4%

74.2% 74.0%

56.3%

residence 64.6%

62.4%

64.4%

55.6 % 55.0% 54.8%

49.1%

school

7.4%

16.8%

7.0%

10.6%

9.8%

10.6%

10.1%

|f acts|

482

226

431

189

580

189

10K

Table II-1 Best accuracy results for nominal attributes on Facebook meric attributes. However, it relies on linear regression, and the training has failed when using a partial social graph. In this experiment, the accuracy is measured differently than in the experiments on nominal attributes since the goal for numeric attributes is to predict attribute values that are as close as possible to the real values rather than predicting the exact value. Thus, the accuracy for numeric attributes is measured in terms of average error and standard deviation of predicted values from the real values. Table II-3 show the best results obtained by PGPI and PGPI+ algorithms for each numeric attribute for the Facebook and Pokec datasets, and the number of facts accessed to obtain these results. It can be first observed that PGPI+ performs the best on overall. For example, the PGPI-N+ algorithm performs better than the PGPI-N algorithm both in terms of average error and standard deviation for almost all attributes. For some attributes, the improvement provided by PGPI+ over PGPI is greater than for other attributes. For example, for the height attribute, the average error obtained by PGPI-N+ is 10.32 instead of 14 for PGPI-N and the standard deviation is 10.32 for PGPI-N+ instead of 36.52 for PGPI-N, a considerable improvement. The reason why PGPI+ performs better than PGPI in general is that it has specific inference procedures for numeric attributes, while PGPI treat them as nominal attributes.

41

attribute

PGPI+

PGPI-N+ PGPI-G+ NB

RNB

CNB

LP

Gender

95.60%

61.40%

95.77%

52.80% 53.80% 53.60%

49.20%

English

76.35% 63.79%

76.00%

69.74% 69.74% 69.74%

65.40%

French

87.46% 84.48%

87.42%

86.91% 85.60% 86.87%

67.15%

German

62.39%

54.31%

62.85%

47.83% 48.12% 47.83%

50.00%

Italian

94.87%

94.25%

94.85%

94.65% 95.38% 95.41%

85.75%

Spanish

95.15% 94.54%

95.14%

94.38% 95.08% 94.29%

80.52%

Smoker

65.21%

62.34%

65.42%

63.43% 63.43% 63.12%

60.19%

Drink

71.65% 63.36%

71.47%

70.41% 70.41% 70.41%

49.16%

Marital status 76.57% 70.86%

76.40%

76.11% 76.02% 76.07%

69.92%

Hip-hop

86.51% 82.20%

86.47%

86.01% 85.83% 85.93%

61.82%

Rap

69.33%

63.78%

69.52%

69.08% 69.35% 69.35%

45.78%

Rock

77.93%

73.80%

78.09%

76.33% 74.93% 74.93%

53.69%

Disco

58.40%

52.50%

58.56%

50.07% 53.18% 53.46%

47.28%

Metal

86.19% 83.52%

86.15%

84.75% 84.61% 84.61%

63.79%

Region

18.60%

10.20%

18.71%

6.20%

6.20%

6.20%

10.00%

|f acts|

334

6

347

375

378

278

10k

Table II-2 Best accuracy results for nominal attributes on Pokec Assessment of certainty values. Another experiment is also performed to assessed certainty values calculated by PGPI+, PGPI-N+ and PGPI-G+. Fig. II-2 and Fig. II-3 respectively show the best accuracy obtained for numeric attributes year and age for each of those algorithms. Each line represents the accuracy of predictions having at least a given certainty value. It can be seen, that a high certainty value generally means a low average error, standard deviation and coverage (percentage of predictions made), as expected. For nominal attributes, the accuracy of predictions made by PGPI-N+, PGPI, PGPI-G+ having a certainty no less than 0 and no less than 0.7 are respectively 59%, 59% and 61% on Pokec, and 91%, 91% and 93% on Facebook, which also shows that the proposed certainty assessment is a

42

Table II-3 Average error and standard deviation for numeric attributes Facebook

Pokec

algorithm year

|f acts| age

height

|f acts|

PGPI+

0.95 (0.85)

241

2.94 (4.55) 9.83 (10.32)

7.70 (11.75)

189

PGPI-N+

0.68 (0.72)

44

3.92 (4.56) 14.60 (12.56)

10.32 (12.55) 6

PGPI-G+

0.99 (0.89)

252

2.89 (4.45) 9.83 (10.37)

7.71 (11.76)

290

PGPI

0.46 (0.93)

148

2.55 (4.80) 11.67 (11.53)

8.75 (12.43)

251

PGPI-N

0.46 (0.87)

103

4.35 (5.11) 17.28 (15.61)

14.0 (36.52)

6

PGPI-G

0.39 (0.93)

223

2.20 (4.78)

8.35 (12.45)

344

NB

1.30 (0.81)

1,743

5.78 (4.34) 14.13 (12.40)

13.71 (11.17)

8

RNB

1.30 (0.81)

1,746

3.47 (4.71) 13.74 (13.91)

15.65 (13.92) 83

CNB

1.30 (0.81)

1,746

3.33 (4.40)

15.30 (12.47) 84

LP

1.12 (1.30)

11,247

5.07 (5.38) 18.42 (59.35)

weight

10.75 (10.86)

13.38 (14.34)

19.29 (15.80) 20,000

good indicator of accuracy. Best results using the full social graph. We also compared the accuracy of the algorithms using the full social graph. The best accuracy obtained for each algorithm on the Facebook and Pokec datasets is shown in Table II-4. It can be observed that the proposed PGPI+ algorithms provide an accuracy that is considerably higher than the accuracy of the compared algorithms, even when using the full social graph.

6

Conclusion We proposed an improved algorithm named PGPI+ for user profiling in online social networks

under the constraint of a partial social graph and using rich information. PGPI+ extends the PGPI algorithm with new optimizations to improve its prediction accuracy and coverage, to handle numerical attributes, and assess the certainty of predictions. An experimental evaluation with more than 30,000 user profiles from the Facebook and Pokec social networks shows that PGPI+ predicts

43

400

600

800

0

standard deviation standard deviation standard deviation

300

600

4.2 1,74.15 1,6 4.1 4.05 1,5 4 1,4 3.95 1,3 3.9 1,23.85 1,1 3.8 3 0 1 100 0 200

400

400

800

4 200 3005 4006 400 600 800 Number of accessed facts Number of accessed facts Number of accessed facts

5,4 4.9 5,2 5 4.85 54,5 4.8 4 4,83,5 4.75 4,6 3 2,5 4.7 4,4 2 4.65 4,21,51 4.6 40,5

0

0

0 3

100 100 98 9690 100 9480 90 92 70 90 80 88 60 70 86 60 8450 50 40 82 40 80 30

100

200

300

400

30 0 20 20 0 0

400

600

800

100

200

300

400

200 of accessed 400 600 800 Number facts 100 200 Numberofofaccessed accessed facts Number facts

100 98 100

10096 9098 94 8096 92 7090 6094 88 5092 86 40 84 3090 82 2088 1080 6400 0 30 800 0

5 300 100 4 200 Number of400 accessed600 facts 200

Number of accessed accessed facts0.7 0.3 of 0.5 facts Number 0.3

0.7

0.5

1004 200 5 300 6 200 600 800 Number of400 accessed facts facts Number of accessed Number of accessed facts

400

0.9

0.7 coverage

8,5 8 7,5 7 6,5 6 5,5 5 4,5 4 3,5

200

Number of accessed facts

200 600 800 Number of 400 accessed facts 100 200 Number Numberofofaccessed accessedfacts facts 0.5 0.7 0.3 0.5 0.70.9

PGPI-G+

0

standard deviation

average error

4,5 4,3 4,1 3,9 3,7 3,5 3,3 3,1 2,9 2,7 2,5

800

PGPI-N+

0.5

b

600

coverage coverage coverage

average error

b)

200

400

facts 200 0 Number of accessed 100 Number of accessed facts Number of accessed facts

standard deviation

3,9 3,7 3,5 3,3 3,1 2,9 2,7 2,5

100

200

5,4 5,6 5,1 5,2 4,6 4.6 5 4,1 4.1 4,8 3,6 3.6 4,6 3,1 3.1 2,6 2.6 4,4 2,1 2.1 4,2 1,6 1.6 4 1,1 1.1 0,6 0.6 0 00

standard deviation standard deviation

1.05 1,2 3,7 1,1 3,5 1 0.95 1 3,3 3,1 0,9 0.9 2,9 0.85 0,8 2,7 0.8 0,7 2,5 0.75 0,6 0 0.7 0 average error average error

average error

b)1,3 3,9

average errorerror average

a)

200

Number of accessed facts

coverage

200

Number of accessed facts

100 90 80 70 60 50 40 30 20

coverage coverage

0

PGPI+

5,6 5,1 4,6 4,1 3,6 3,1 2,6 2,1 1,6 1,1 0,6

coverage

standard deviation

average error

1,3 1,2 1,1 1 0,9 0,8 0,7 0,6

100 95 90 85 80 75 70 65 60 55 50

Figure II-2 Average error, standard deviation and coverage w.r.t. certainty for the "year" attribute 0

100

200 300 400 algorithm Facebook Pokec 0 100

Number of accessed facts

algorithm 0 100Pokec 200 300 400Facebook

200

PGPI+

96.6

PGPI-N+

96.4

PGPI-G+

73.8

PGPI-G 0.7

62.8

56.2

73.9

NB

48.67

57.48

84.9

65.9

RNB

50.11

56.37

PGPI

63.8

62.0

CNB

50.11

56.40

PGPI-N

63.8

62.1

LP

48.03

47.31

0.3

0.5

300

400

Number of accessed facts

Number of accessed facts

Table II-4 Best accuracy for nominal attributes using the full social graph user profiles with considerably more accuracy and by accessing a smaller part of the social graph than five state-of-the-art algorithms. Moreover, an interesting result is that profile attributes such as status (student/professor) and gender can be predicted with more than 95% accuracy using PGPI+. 44

0

200

400

600

4,1 3,6 3,1 2,6 2,1 1,6 1,1 0,6

800

coverage

standard devi

average err

1,1 1 0,9 0,8 0,7 0,6

0

Number of accessed facts

200

100

200

300

400

Number of accessed facts 0 100 200

4.2 4.15 1,74.1 1,6 4.05 1,5 4 1,4 3.95 1,33.9 3.85 1,2 1,1 0 3.8 1

3

200

400

4

600

5

Number of accessed facts Number of400 accessed facts 0 200 600

800

5,6 5,1 4,64.9 4,1 4.85 5 3,64,5 3,14.8 4 3,5 4.75 2,6 3 2,14.7 1,62,5 2 4.65 1,1 1,5 0,64.6 1 0,50 0 3

6

800

0

standard deviation

4,5 4,3 4,1 3,9 3,7 3,5 3,3 3,1 2,9 2,7 2,5

0

0.7 PGPI+

0.9

5,4 5,28,5 8 57,5 4,8 7 6,5 4,6 6 4,45,5 5 4,24,5 44 3,50

0

100

100

200

200

300

300

Number of accessed facts Number of accessed facts

400

400

400

600

800

100 98 96 100 94 90 92 90 80 88 70 86 60 84 50 82 80 40

100

200

300

30 0 20

400

Number of 100 accessed facts 200 Number of accessed facts 0.3 0.5 0.7 0.5 0.7 0.9 PGPI-N+

100

0

200

300

400

Number of accessed facts 100

200

Number of accessed facts 100 90 100 80 100 98 70 90 96 80 60 7094 50 60 92 50 40 40 90 30 30 20 88 20

200

4

400

5 600

800 6

10 0

Numberof ofaccessed accessed facts Number facts 800 200 400 600 0.3 0.5 facts 0.9 0.7 0.5 0.7 Number of accessed 0.5

0.7 PGPI-G+

03

200 4 400 5 600 Number of accessed facts

6 800

Number of accessed 200 400 600 facts800

0

Number of accessed facts

0.9

100 10098 9596 9094 8592 8090 7588 7086 6584 6082 5580

coverage coverage

standard deviation

average error

average error

Number of accessed facts 3,9 3,7 3,5 3,3 3,1 2,9 2,7 2,5

200

coverage coverage

1.1 0.6 0 0

standard deviation standard deviation standard deviation

average error

average error error average

a)

1,3 1,2 1,1 1 0,9 0,8 0,7 0,6

0

Number of accessed facts

5,4 5,2 4.65 4.1 4,8 3.6 4,6 3.1 4,4 2.6 2.1 4,2 1.64

Number of accessed facts

a)

800

coverage coverage coverage

0.7

standard deviation standard deviation

average error average error

a)

600

Number of accessed facts 0.5

3,9 3,7 1.05 3,5 1 3,3 0.95 3,1 0.9 2,9 0.85 2,7 0.8 2,5 0.75 0

400

80 70 60 50 40 30 20

0

100

100

200

200

300

300

Number of accessed facts Number of accessed facts

0.3 0.3

0.50.5

400

400

50

0

0

100

100

200

200

300

300

400

400

Number of accessed Number of accessed facts facts

0.7 0.7

Figure II-3 Average error, standard deviation and coverage w.r.t. certainty for the "age" attribute There are several opportunities to improve the proposed algorithms. For future work, we plan to propose new optimizations for PGPI+ and to extend the proposed algorithms with text analysis capability and to consider other types of information for making predictions. We also wish to perform experiments with other social networks and compare accuracy with more algorithms. Lastly, we also intend to develop algorithms for community detection [18, 22, 26] that are optimized for using a small amount of facts from social networks.

45

CONCLUSION GÉNÉRALE

Dans ce travail, nous avons proposé deux algorithmes novateurs pour inférer les profils utilisateurs dans les réseaux sociaux en utilisant un graphe social partiel. Ces algorithmes sont à notre connaissance, les premiers à laisser l’utilisateur décider de la quantité d’information à accéder du graphe social pour faire une prédiction. Le premier algorithme PGPI, permet l’inférence de profils utilisateurs en l’absence d’un graphe social complet et ne requiert aucun entraînement pour être appliqué. Il est composé de deux procédures nommées PGPI-N (Partial Graph Profile Inference- Node) et PGPI-G (Partial Graph Profile Inference- Groupe). Elles permettent respectivement l’inférence de profils utilisateurs sur la base (1) des liens d’amitiés, et (2) des adhésions aux groupes, "J’aime" et "Vues" (lorsque l’information est disponible). Une évaluation expérimentale a été présentée comparant la performance de PGPI à quatre algorithmes d’inférence de profils utilisateurs de la littérature (NB, CNB, RNB et LP). Un jeu de donées réel provenant du réseau social Facebook a été utilisé. Dans cette expérience, trois mesures de performance ont été employées : l’exactitude, le nombre de noeuds accedés dans le graphe social et la couverture. Les résultats ont montrés que PGPI obtient une exactitude allant jusqu’à 34% supérieure à celle des méthodes proposés, tout en utilisant moins d’information pour prédire. Il est aussi intéressant de noter que PGPI peut prédire certains attributs avec une exactitude supérieure à 90%. Bien que PGPI procure une très haute exactitude et accède à moins de noeud dans le graphe social, PGPI traite les attributs numériques comme des attributs nominaux et ne tiens pas compte de la certitude des prédictions. Afin de remédier à ces limites, nous avons introduit une version améliorée de PGPI nommée PGPI+. Elle introduit de nouvelles optimisations pour augmenter l’exactitude des prédictions et la couverture de PGPI. De plus, PGPI+ traite les attributs numériques de façon adaptée et offre un mécanisme pour estimer la certitude des prédictions. Grâce à ces nouveautés, PGPI+ offre un gain considérable en termes d’exactitude et couverture des prédictions. 46

Nous avons réalisé une expérience pour comparer la performance de PGPI+ avec PGPI et celles des quatre mêmes algorithmes de la littérature, en utilisant des profils utilisateurs des réseaux sociaux Pokec et Facebook. Les résultats montrent que PGPI+ offre une plus haute exactitude pour tous les jeux de données utilisés, tout en accèdant à moins de noeuds dans le graphe social. Nous avons aussi comparé la meilleure exactitude de PGPI/PGPI+ pour les attributs numériques en termes d’erreur moyenne et d’écart-type des valeurs prédites par rapport aux valeurs réelles. Les résultats indiquent que PGPI+ donne en général une inférence plus exacte. Finalement, des résultats expérimentaux montrent également que la certitude calculée pour les prédictions est un bon indicateur de l’exactitude des prédictions. Cette recherche ouvre la voie à plusieurs possibilités de travaux futurs. Nous envisageons entre autres de comparer PGPI+ avec d’autres algorithmes d’inférence de profils utilisateurs spécialisés pour le traitement d’attributs numériques ou l’estimation de la certitude des prédictions. De plus, nous prévoyons améliorer l’algorithme PGPI+ en proposant de nouvelles optimisations pour le traitement des attributs numériques. Finalement, des expériences avec les données d’autres réseaux sociaux sont envisagées.

47

RÉFÉRENCES

[1] Blenn, N., Doerr, C., Shadravan, N., Van Mieghem, P.: How much do your friends know about you?: reconstructing private information from the friendship graph. In: Proc. of the Fifth Workshop on Social Network Systems, pp. 1–6. ACM (2012) [2] Bonchi, F., Castillo, C., Gionis, A., Jaimes, A.: Social network analysis and mining for business applications. ACM Transactions on Intelligent Systems and Technology, 2(3), 1–37 (2011) [3] Chaabane, A., Acs, G., Kaafar, M.A.: You are what you like! information leakage through users’ interests. In: Proc. of the 19th Annual Network and Distributed System Security Symposium, The Internet Society (2012) [4] Chaudhari, G., Avadhanula, V., Sarawagi, S.: A few good predictions: selective node labeling in a social network. In: Proc. of the 7th ACM international conference on Web search and data mining, pp. 353–362. ACM (2014) bibitemchen Chen, P. T., Hsieh, H. P.: Personalized mobile advertising: Its key attributes, trends, and social impact. Technological Forecasting and Social Change, 79(3), 543–557 (2012) [5] Dahimene, R., Constantin, C., Du Mouza, C.:Recland: A recommender system for social networks. In: Proc. 23rd ACM Intern. Conf. on Information and Knowledge Management, pp. 2063–2065 (2014) [6] Davis Jr, C. A. et al.: Inferring the location of twitter messages based on user relationships. Transactions in GIS, 15(6), pp. 735–751 (2011) [7] Dong, Y., Yang, Y., Tang, J., Yang, Y., Chawla, V. N.: Inferring user demographics and social strategies in mobile social networks. In: Proc. of the 20th ACM international conference on Knowledge discovery and data mining, pp. 15–24. ACM (2014) [8] Dougnon, Y. R., Fournier-Viger, P., Nkambou, R.: Inferring User Profiles in Social Networks using a Partial Social Graph. In: Proc. 28th Canadian Conference on Artificial Intelligence, Springer, LNAI 9091, pp. 84–99 (2015) 48

[9] He, J., Chu, W. W., Liu, Z. V.: Inferring privacy information from social networks. In: Proc. of 2006 IEEE International Conference on Intelligence and Security Informatics. pp. 154– 165. Springer, Heidelberg (2006) [10] Heatherly, R., Kantarcioglu, M., Thuraisingham, B.: Preventing private information inference attacks on social networks. IEEE Transactions on Knowledge and Data Engineering. 25(8), pp. 1849–1862 (2013) [11] Jurgens, D.: That’s what friends are for: Inferring location in online social media platforms based on social relationships. In: Proc. of the 7th International AAAI Conference on Weblogs and Social Media, pp. 273–282, AAAI Press (2013) [12] Kosinski, M., Stillwell, D., Graepel, T.: Private traits and attributes are predictable from digital records of human behavior. National Academy of Sciences, 110(15), pp. 5802–5805 (2013) [13] Li, R., Wang, C., Chang, K. C. C. User profiling in an ego network: co-profiling attributes and relationships. In: Proc. of the 23rd international conference on World wide web, pp. 819– 830. ACM (2014) [14] Lindamood, J., Heatherly, R., Kantarcioglu, M., Thuraisingham, B.: Inferring private information using social network data. In: Proc. of the 18th international conference on World wide web, pp. 1145–1146. ACM (2009) [15] Compendium of global digital statistics. We are Social Ltd. (2015) [16] Mislove, A., Viswanath, B., Gummadi, K. P., Druschel, P.: You are who you know: inferring user profiles in online social networks. In: Proc. of the 3rd ACM international conference on Web search and data mining, pp. 251–260. ACM (2010) [17] Mazhari, S., Fakhrahmad, S. M., Sadeghbeygi, H.: A user-profile-based friendship recommendation solution in social networks. Journal of Information Science, 41(3), 284– 295(2015) [18] Papadopoulos, S., Kompatsiaris, Y., Vakali, A., Spyridonos, P.: Community detection in social media. Data Mining and Knowledge Discovery, 24(3), pp. 515–554 (2012) [19] Petrovic, I., Perkovic, P., Stajduhar, I.: A profile-and community-driven book recommender system. In: 38th International Convention on Information and Communication Technology, 49

Electronics and Microelectronics, pp. 631–635 (2015) [20] Quercia, D., Kosinski, M., Stillwell, D., Crowcroft, J.: Our Twitter profiles, our selves: Predicting personality with Twitter. In: Proc. of the 3rd IEEE international conference on social computing, pp. 180–185, IEEE Press (2011) [21] Scott, J., Carrington, P. J.: The SAGE handbook of social network analysis. SAGE publications (2011) [22] Takaffoli, M., Sangi, F., Fagnan, J., Zäıane, O. R.: Community evolution mining in dynamic social networks. Procedia-Social and Behavioral Sciences, 22, pp. 49–58 (2011) [23] Traud, A. L., Mucha, P. J., Porter, M. A.: Social structure of Facebook networks. Physica A: Statistical Mechanics and its Applications. 391(16), pp. 4165–4180 (2012) [24] Tiroshi, A., Kuflik, T., Kay, J., Kummerfeld, B.: Recommender systems and the social web. In: Proc. 19th international conference on Advances in User Modeling, pp. 60-70 (2012) [25] Tucker, C. E.: Social networks, personalized advertising, and privacy controls. Journal of Marketing Research, 51(5), 546–562 (2014) [26] Wakita, K.,Tsurumi, T.: Finding community structure in mega-scale social networks. In: Proc. 16th international conference on World Wide Web, pp. 1275–1276 (2007) [27] Zhou, X., Xu, Y., Li, Y., Josang, A., Cox, C.: The state-of-the-art in personalized recommender systems for social networking. Artificial Intelligence Review, 37(2), 119–132 (2012)

50