igraph Database Project - Thibault François Franville-Lafargue

Mar 13, 2009 - 1.4 Risk management. Before starting, we had to identify the risks that could occur during our activity, to prevent them from happening.
721KB taille 0 téléchargements 140 vues
igraph Database Project

Final Report

Mathieu Anquetin Pierre Baudemont Thibault Franville-Lafargue Emmanuel Navarro Guillaume Racineux

Client : Bruno Gaume Tutor : Emmanuel Murzeau March 13, 2009

Abstract This document is the report we have realized as ENSEEIHT students for our third year long term project, from January 27 to March 13, 2009. It describes our work on the igraph library, aimed at using a database instead of physical memory.

Contents Introduction

1

2

6

Scientific context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

igraph , an existing library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

Project management

8

1.1

Team work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

1.1.1

Team . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

1.1.2

Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

1.2

Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

1.3

Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

1.3.1

Week by week work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

1.3.2

Schedule evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

1.4

Risk management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

1.5

Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

1.5.1

Collaborative Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

1.5.2

Additional software or libraries used . . . . . . . . . . . . . . . . . . . . . . . . .

15

Project realization

16

2.1

The layered architecture of igraph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

2.2

Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

2.2.1

Version 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

2.2.2

Version 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

2.2.3

Version 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

Tests and benches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

2.3.1

Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

2.3.2

Benches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

Performance measure results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

2.3

2.4

Conclusion

26

1

2.5

Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

2.6

Experience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

2.7

What next ? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

A Appendix A : tutorials

27

A.1 Installation tutorial on Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

A.2 Installation tutorial on Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

A.2.1 Cygwin or MinGW ? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

A.2.2 Installation of MinGW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

A.2.3 Installation of MSYS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

A.2.4 libxml2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

A.2.5 Compiling igraphdb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

A.3 Getting started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

A.4 Creating and using a DSN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

A.4.1 With Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

A.4.2 With Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

A.5 List of compatible databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

B Appendix B : API and internal implementation

32

B.1 Database architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

32

B.2 Memory data structure architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

32

B.2.1

igraph_t structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

33

B.2.2

igraphdb_cache_t structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

B.2.3

hstmt_container_t structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

B.3 Database related functions API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35

B.3.1

Environment variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35

B.3.2

Added functions API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35

B.4 Unit test description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35

B.4.1

Graph Constructors and Destructors . . . . . . . . . . . . . . . . . . . . . . . . .

36

B.4.2

Basic Query Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

B.4.3

Adding and Deleting Vertices and Edges . . . . . . . . . . . . . . . . . . . . . . .

37

C Appendix C : Requirements

39

C.1 Functional requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

39

C.1.1

Functions of the basic interface . . . . . . . . . . . . . . . . . . . . . . . . . . .

39

C.1.2

Constructors and destructors of the basic interface . . . . . . . . . . . . . . . . .

39

C.1.3

Query operations of the basic interface . . . . . . . . . . . . . . . . . . . . . . .

40

C.1.4

Modification operations of the basic interface . . . . . . . . . . . . . . . . . . . .

41

2

C.1.5

Useful upper functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

C.1.6

Python wrapper functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

C.1.7

Other functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

C.2 Environment requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

C.2.1

Operating system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

C.2.2

Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

C.3 Documentation requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

C.3.1

Tutorials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

C.3.2

Architecture and performance . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43

C.4 Function lists : The basic interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

44

C.5 Functions list : useful upper functions . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45

C.6 Bench tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45

C.6.1

Example graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45

C.6.2

Test algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45

3

List of Figures 1

Project global view . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

1.1

The last version of the schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

1.2

Schedule evolution sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

2.1

The layered architecture of igraph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

2.2

igraphdb with cache architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

2.3

Direct mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

2.4

Cache vertex/edge table cell content . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

21

2.5

Vertex score list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

21

2.6

Cache whole structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

4

List of Tables 1.1

Risk study results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

2.1

An example of a graphs_index table . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

2.2

An example of a graph_edges_X table . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18

2.3

Requirement and test correspondence matrix . . . . . . . . . . . . . . . . . . . . . . . . .

23

2.4

Memory usage measure results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

2.5

Version performance comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

2.6

Database performance comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

B.1 Structure of table graphs_index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

32

B.2 Structure of table graph_edges_X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

32

C.1 Graph Constructors and Destructors . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

44

C.2 Basic Query Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

44

C.3 Adding and Deleting Vertices and Edges . . . . . . . . . . . . . . . . . . . . . . . . . . .

44

C.4 Useful upper layer functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45

5

Introduction Scientific context Graphs are very simple mathematical objects : a set of vertices connected together by a set of edges. However graphs are powerful modeling tools used in numerous research fields and by many applications (for storage, analysis, ...). One of the best examples of a useful graph would be the internet hyper-link graph. If each Internet page is a vertex, each hyper-link between pages is an edge between the two vertices representing those pages. This graph, which represents the hyper-link structure of Internet, is useful to improve performance of search engines. Precisely our client, Bruno Gaume, pursues research on hyper-link graphs within the Quaero project1 . Graphs used for such works are extremely large. For instance a graph extracted solely from the French version of the Wikipedia website counts about 400 thousand vertices and more than 13 millions edges. And a part of the Exalead crawl provided to the Quaero project contains 2 millions pages.

igraph , an existing library igraph is an existing free software package for creating and manipulating undirected and directed graphs, [?]. It includes implementations for classic graph theory problems like minimum spanning trees and network flow, and also implements algorithms for some recent network analysis methods, such as community structure search. One limitation of igraph is that it can only work on graphs which fit into the physical memory. Thus it could not handle graphs such as those on the Internet. Hence the goal of this project : rewriting a part of the igraph library in order to use a database instead of the physical memory. It will allow to work on much larger graphs than the current version does, as data is stored out of core2 .

1 see

http://www.quaero.org is precisely the way how such software architectures are named in computer science, see http://en.wikipedia.org/ wiki/Out-of-core_algorithm 2 it

6

Figure 1: Project global view

7

Chapter 1

Project management 1.1

Team work

The project has been instigated by the client, Bruno Gaume. He was looking for an efficient tool to run research experimentations over very large hyper-link graphs. The project is also supervised by Emmanuel Murzeau, currently working as project manager at Airbus. Mr. Murzeau has helped us to handle the project by giving us advices and sharing his management experience.

1.1.1

Team

The project is directed by five engineering students from ENSEEIHT. Here are the names and main roles of everyone : • Emmanuel : Project manager • Thibault : Prototype representative and core functions coder • Mathieu : Collaborative environment, integration and benches development representative • Pierre : Coding representative and core functions coder • Guillaume : Testing representative These roles were not very strict, but the work was mainly dispatched according to them. Indeed, working with an efficient collaborative environment allowed us to make a better use of our main strength by sharing tasks with the ticket system (cf. Collaborative Environment).

1.1.2

Communication

Our work has been organized around weekly meetings. These meetings took place with or without the tutor. We used these meetings to summarize the work done during the week, and to compare it with the schedule. If the schedule was not respected we took measures, such as re-assigning people on tasks, modifying the schedule, or deleting some tasks. The other task we had for meetings was to prepare the work for the next week. So, the project manager distributed tasks to people according to their role and abilities. At the end of each meeting, the project manager summarized it and sent e-mails to remind us about our tasks. 8

As we worked at home, our day to day communication was made mainly with e-mails and instant messaging. We have also used the ticket system provided by Trac. These tickets were very useful to share tasks and report bugs. They were an important part of our communication.

1.2

Requirements

The project has been launched during the first meeting with the client and the supervisor (on the 22th of January). Starting from this point a requirement document has been written to specifically define what the need was. The goal of this project was to rewrite the basic API of the igraph library in order to use a database instead of the physical memory. This way, bigger graphs can be used, as the limiting factor will not be the physical memory anymore, but the disk space. Fortunately, igraph has been developed with the idea that every function of the upper layers should use the basic one : the basic API. This basic API is a collection of low level functions (about 20 functions) which represent the direct interface with the graph structure (constructors, destructors, getters, setters). So any other function should normally use this basic API to have access to the graph. Direct access to the structure of the graph should have been completely avoided as only the basic API handle it. Therefore, the project would normally only consists in rewriting the basic API. Hence the first set of requirements : Functions shall keep the same functionality as in original igraph . Functions shall keep the same signature as in original igraph . Indeed, the new basic API should remain compatible with the unchanged upper functions of igraph . So the basic functions have to keep the same functionality, the same signature, the same return type and same error type. Furthermore, the new library should use at least one kind of database : The library shall be able to use SQLite. The library shall be adaptable to any database using ODBC. By using ODBC, we can ensure that the new version of igraph will be compatible with any kind of database. Another requirement is to provide a data caching system in memory : Functions shall query the database to get data. Functions shall manage a cache and try to get data from it, before querying the database. A cache is a classic computer science technique which consists in duplicating data in a rapid access memory (here the physical memory) when the original data is expensive to fetch (here by a database query). Also, igraph has to remain compatible with the same OS as before :

9

The library shall work on Linux (32 bits). The library shall work on Linux (64 bits). The library shall work on Windows XP (32 bits). The library shall work on Windows Vista (32 bits). The other requirements (function per function and documentation requirements) as well as the list of the functions in the basic API can be found in the Requirement document, Appendix C.

1.3

Schedule

Figure 1.1 shows the last version of the schedule.

Figure 1.1: The last version of the schedule

1.3.1

Week by week work

The first week of the project (week 4) has been used for : • collaborative environment (Trac, SVN, Wiki, Hudson) installation • handover of igraph • handover of SQLite and ODBC

10

• writing the requirement document (this document has been updated until sending the final version to the client for signing on the 16th of February) • writing a speed comparative test between ODBC access and SQLite direct access. Finally, ODBC access is as fast as a direct access to the database and even faster in some cases, because ODBC drivers are sometimes very well optimized. During the second week (week 5) of the project, we made : • the design of the database and of the constructors/destructors. We decided to create an unique index table for storing the ids of a graph and its attributes, such as its number of vertices or its orientation. An additional edge table will be created for each graph in order to store its edges (one edge in the table will be represented by an edge id, the source vertex id and the target vertex id). • the start of the igraphdb prototype (V0 version) • the handover of the existing igraph tests (unit and integration tests) • the creation of scripts for hudson • some documentation : compiling and installation tutorials, and database design The third week (week 6) was used to : • review the requirement document with the client who clarified his vision of the cache. So the requirements were finished by the end of the week and sent on Monday, the 16th of February. • finish the first V0 prototype of igraphdb (ODBC access but no database optimization nor cache yet) • create new unit tests for igraphdb • start to conceive the design and documentation of the cache Fourth week (week 7) is for : • sending the requirements document to the client for signing • debugging the V0 prototype • creating speed benches for the prototype and running them • optimizing the igraphdb prototype (still no cache but optimization of the database : V1 version) with the help of the speed benches At the middle stage of the project, we were borderline with the schedule but we were still on time. The development of the second version of the library : • V1 : ODBC access with optimization but no cache was almost finished and ready to be tested. The fifth week (week 8) has been used for : • debugging V1 • implementing V2 (version with cache) 11

During the sixth week (week 9) we worked on • continuing V2 • writing documentation • preparing the final report and the presentation Finally, for the seventh week (week 10), we planned to do only presentation training and report writing as the presentation and the report are due to Friday. Simultaneously, we were trying to finish implementing V2.

1.3.2

Schedule evolution

About the schedule, the whole group can agree that it is indeed very useful to have a global time vision of the project in order to keep working and to have a good idea of the project progress. It is also really important to make weekly update in order to see delays and to do schedule modifications if needed. Figure 1.3.2 shows how our schedule has changed from day 1 to the last week of the project.

(a) February, the 3

(b) February, the 4

(c) February, the 13

(d) March, the 9

Figure 1.2: Schedule evolution sample

12

1.4

Risk management

Before starting, we had to identify the risks that could occur during our activity, to prevent them from happening. Once they were identified, we evaluated them according to their criticality and their probability. Their global evaluation, which was the product of these criteria, allowed us to sort them. We could finally search for an anticipation action and a solution - in case of happening - for each one. Here is the result of this study : Name Server Crash

Crit. 3/3

Proba. Eval. 2/3 6/9

Anticipation Backups

Team member unavailability Unexpected difficulty ODBC lack of performance Unexpected modifications from the client Multi-layer organisation violation

3/3

2/3

6/9

2/3

3/3

6/9

2/3

2/3

4/9

2/3

2/3

4/9

Schedule margins Schedule margins Comparison test bench Schedule margins

3/3

1/3

3/9

Original source code analysis

Solution Restore from backup Schedule revision, tasks redistribution Schedule revision Work directly with the DBMS API Schedule revision

Schedule revision, rewrite the incriminated code

Table 1.1: Risk study results

1.5 1.5.1

Tools Collaborative Environment

In this project, we have made a heavy use of our collaborative environment. As we were generally working at home, we needed an efficient collaborative environment so as to work in the same time and to share our ideas. We have used : • Instant Messaging (chat or VoIP) for discussion, help and resolution of the small problems that anyone of us daily encountered. • Trac, which is an enhanced wiki and issue tracking system for software development projects. Trac uses a minimalistic approach to web-based software project management. Its mission is to help developers to write great softwares while staying out of the way. It is a perfect collaborative environment for open source oriented development. Trac provides : • an interface to Subversion • an integrated Wiki and convenient reporting facilities 13

• a ticket system • a milestone management • a hudson interface with automated builder and publisher 1.5.1.1

Subversion

Subversion (SVN) is a version control system. It is used to maintain current and historical versions of files such as source codes, web pages, and documentation. 1.5.1.2

Wiki

A wiki is a page or collection of web pages designed to enable anyone who accesses it to contribute or modify content. We have used it mainly to write documentation (design and architecture explanations, tutorials) and as a journal for keeping traces of our work progression. So everyone can watch and understand what the others have done. 1.5.1.3

Ticket system

The ticket system is a really convenient way of creating and distributing new tasks. When someone finds out a problem or thinks about an idea, he can create a new ticket with the task to do. Then, someone in the group who has finished his work can visit the ticket store and take the task on a ticket by accepting it, thus the task will be assigned to him. The ticket system is used mainly to create and propose new tasks such as debugging a specific function, finding out the source of an error, creating a new specific bench or test, ... 1.5.1.4

Milestone management

A milestone is the end of a stage that marks the completion of a work phase. In Trac, scheduling and describing a milestone is made easy thanks to the ticket system. Indeed, after setting the due date of the milestone, the project manager links some tickets with it. Then, ending a milestone corresponds to the fixing of every ticket attached with this one. 1.5.1.5

Hudson

Lastly, we have also used Hudson, a continuous integration server. The main goal of Hudson is to ease the work of developers and increase productivity by automating some steps in the development process. To make things clearer, let us outline a typical example : • Every developer changes a part of the source code • Alone, he checks that the code modification is correct by running tests on it • After that, he commits his changes to the version control system • Next, a developer must compile the whole application found on the version control system to make sure that a conflict did not arise from all the individual changes • Also, he has to perform some tests to make sure there was no regression between the old version and the new one • When everything is certified as correct, the application can be distributed

14

In this scenario, we see that a person is responsible for the whole test and integration process. This means that this person can not be part of the development and must run the same scenario again and again. To avoid this problem, Hudson takes the role of this person and automates it. Briefly, every time a change is found on the version control system, Hudson builds the new version, runs tests and distributes the application. Moreover, if a bug is found, the responsible people are warned by email, RSS, instant messaging, . . . By automating this process, a developer is now free to help his co-workers and the human time spent to do these repetitive tasks can now be used for something else and that is why productivity is increased.

1.5.2

Additional software or libraries used

1.5.2.1

GanttProject

For planning and schedule, we have used the software GanttProject1 . It is a free software which allows to create and manage very easily a schedule for such a project. 1.5.2.2

CUnit

To make sure our functions were respecting the igraph API, we needed unit tests. For this purpose we have used the C library Cunit. This library is an equivalent of the Java Junit. It permits the creation of series of tests, in which you can create an environment, and assert that the results are those expected. The results can be displayed in different ways, including the console, or in an XML file.

1 see

http://ganttproject.biz/

15

Chapter 2

Project realization 2.1

The layered architecture of igraph

Python

R

Ruby

Python

Basic data access Interface (C)

R

Ruby

Basic data access Interface (C)

(a) original igraph architecture

(b) igraphdb architecture

Figure 2.1: The layered architecture of igraph As shown by figure 2.1, igraph is designed in three layers : High level language wrappers : the upper layer Graph operation layer : C layer, implements many classic algorithms from graph theory. Data access layer : C layer, provides basic functions to access graphs. As explained before, all our work is located in the data access layer.

16

2.2

Development

We have developed three versions of our software with an incremental iterative cycle. Each of these versions is detailed below.

2.2.1

Version 0: “proof of concept”

The goal of this first version was to create a simple implementation of igraphdb as a proof of concept. Its characteristics were : • to fulfill specification requirements (mainly those about ”non-regression”) • to store graphs in a database • to provide connection to a database with simple ODBC requests • not to use a cache mechanism The major point at this step was to establish a good architecture for storing graphs in a database. As we wanted a single database to be capable of containing more than one graph, we have used a multi-table architecture. A unique table - the main one - serves as an indexation table, allowing access to the others which actually store the graph data. This index table contains : • graph IDs, for identifying the graphs in the database • the attributes of these graphs (is_directed : whether or not the graph is directed and nb_vertices : the maximum number of vertices in the graph) Table 2.1 is an example of such a table.

graph_id 0 1 2

graphs_index is_directed nb_vertices 0 1055 1 560 0 10500456

Table 2.1: An example of a graphs_index table The other tables in the database represent the graphs. There is one table per graph and it stores only the edges (identified by an edge id), which are the association of a source vertex ID and a target vertex ID. Table 2.2 is an example of such a table.

2.2.2

Version 1: database optimization

Version 1 was meant to optimize the previous one by enhancing the database handling. However no cache mechanism was used at this step. The enhancements consisted in : • using the database index mechanism for the source and target columns of a graph table

17

graph_edges_0 edge_id source target 1 1 2 2 2 3 3 2 4 4 3 5 5 4 5 Table 2.2: An example of a graph_edges_X table • using pre-compiled database requests in every function working with vector or loops • ensuring that the property source < target was verified for every edge of an undirected graphs • storing the is_directed, vcount and ecount attributes in the igraph_t C structure 2.2.2.1

Database index mechanism

A database index is an internal data structure that improves the request speed on a database table. Creating an index costs a little disk space and also slows down a bit the operation of adding and deleting rows in the indexed table. However, as the graph is generally created once for all, the use of indexes is highly profitable as the complexity of a request goes from O(n) to O(log n). Hence, for each graph in the database, an index table is created for indexing the source vertex ids and the target vertex ids. So there are two indexes and one edge table for each graph. 2.2.2.2

Pre-compiled requests

A lot of functions are working either on loops or on vectors. In this case, they generally do the same kind of requests several times where only one parameter differs, for example : for (int i = 0; i < vector.size(); i++){ REQUEST req = new REQUEST("SELECT * FROM graph_edges_1 WHERE source = vector(2*i) AND target = vector(2*i + 1)"); req.Exec(); } Instead, a unique pre-compiled request can be created before the loop, and only the variable parameters will be changed afterwards :

18

PRECOMPILED_REQUEST req = new REQUEST("SELECT * FROM graph_edges_1 WHERE source = vector(?) AND target = vector(?)"); for (int i = 0; i < vector.size(); i++){ req.BindParameter(1, 2*i); req.BindParameter(2, 2*i + 1); req.Exec(); } The best thing to do is indeed to create the pre-compiled request once for all during the initialization of the graph structure. 2.2.2.3

source < target for undirected graphs

In the case of undirected graphs, the same edge can be represented by two couples (source, target). For example, (1,2) and (2,1) represent the same edge if the graph is undirected. By forcing a constraint like source id < target id, the number of research in the table of an undirected graph can be lowered for some requests like : REQUEST req = new REQUEST(SELECT * FROM graph_edges_1 WHERE source = 1 AND target = 2 OR source = 2 AND target = 1; because the request will now just be : REQUEST req = new REQUEST(SELECT * FROM graph_edges_1 WHERE source = 1 AND target = 2;

2.2.3

Version 2: data cached in physical memory

The last version we have developed aimed at introducing a cache mechanism in memory to reduce hard disk access and thus improve performance. The figure 2.2 shows the location of the new cache layer in the architecture of igraphdb . The cache stores vertices (i.e. their ID - called VID, degree, neighbors and adjacent edges) and edges (i.e. their ID - called EID, source VID and target VID). One substantial part of our work on this version was to define an efficient design for this caching layer. The architecture we proposed is detailed below. 2.2.3.1

Direct mapping

To ensure good performance we have chosen to put stored data (i.e. vertices and edges) into tables, so as to have a direct access to cached elements. The question that arises from such an architecture is how to choose a table index for a given data. Because of the performance issue, we have implemented direct mapping with data ID (VID or EID).

19

R

Python

Ruby

Basic data access Interface (C)

Basic data access Interface (C) Cache system (C)

Figure 2.2: igraphdb with cache architecture Direct mapping is a simple mechanism which directly computes the index from an ID. This implies that a data with a given ID will always have the same table index. As cache tables have limited sizes, this index was merely the ID modulo the size of the table. This is illustrated on the figure 2.3. The first line represents vertices stored in the database. The second line is the cache vertex table, with a size of 4 cells. As shown by colors, vertices with the same modulo would lie in the same table cell. DB vertices with their VID

Cache vertex table

cache table index = VID % table_size

Figure 2.3: Direct mapping

2.2.3.2

Cell content

Due to direct mapping, different vertices can have the same table index. Usually when a vertex is to be added into a cell which already contains another one, the cell is cleaned. However igraphdb is used on graphs. This implies that if you are working on a given vertex (thus, it is in the cache), you are also probably working on its neighbors. The point is that its neighbors may have the same table index. So to improve performance we have chosen to store more than one vertex per table cell. Thus we avoid a likely intensive switching phenomena when working on vertex neighborhood. The question was to use an efficient cell structure. We found it good to use binary search tree indexed with data ID, because they provide a research complexity of O(log n). This is illustrated by figure 2.4. 20

Vertex table

VID0 < VID0

Cached Vertex :

>VID0

-

VID score neighbors 'in'/'out' adjacent edges 'in'/'out'

Binary Search Tree ( indexed with VID values, research cpx : O(log(n)) )

Figure 2.4: Cache vertex/edge table cell content As shown in the figure 2.4, a stored vertex consists in : • its VID • a score value (explanation below) • its neighbors, stored in vectors, and separated into IN and OUT considering edges direction • its adjacent edges, with the same structure 2.2.3.3

Scoring

The last point to detail is the cache cleaning policy. If elements are to be added when the cache is full, some of existing ones must be deleted. The problem is to know how to choose them. That is why scoring has been used. This consists in giving a score value to each element in the cache, so that the lowest scored elements are deleted when the cache is full. We have put those values into a sorted doubled linked list. Thus the list is just read from the end to determine what the lowest scored elements are. This score list is illustrated by figure 2.5. Cached Vertex : -

VID score neighbors 'in'/'out' adjacent edges 'in'/'out'

Vertex scores sorted doubled linked list Figure 2.5: Vertex score list With this design it is possible to use different scoring policies. A classical one is LRU, which stands for Last Recently Used, and consists in deleting the last recently used elements. To implement it, we have just put an element on the top of the list each time it is accessed. Other policies would have been possible. For instance, vertex degree could have been taken into account. Other graph elements parameters could have been considered. 21

To make this architecture clearer, the whole structure is summed up in figure 2.6.

Cache structure : - table size

(2^n, fast offset computation)

- maximum total size - vertex table - vertex score list - edge table - edge score list

same structure as vertices...

Figure 2.6: Cache whole structure

Remark : table sizes have to be a power of 2. This permits fast index computation : the modulo is realized with a mere binary AND operation.

2.3 2.3.1

Tests and benches Tests

We needed to assert that the requirements were respected. For this we created different kinds of test. Table 2.3 summarizes which test corresponds to which requirement. The requirements are described in appendix C on page 39.

Unit tests One of the main features of our test plan was unit test for the modified functions. For this we have used the CUnit library. These tests assert that the function return codes are those described in the API, and that the functions work properly with good arguments. We also tried to test the functions with non common arguments. Integration tests A set of examples is given with the igraph library. They are used to test that the installation is correct. As they used a lot of functions from different layers of the library, we took them as an integration test. The other part of our integration tests was the use of the python layer of the library. This test was not automated, but it asserts that there is no major problem in our rewriting. Manual tests Of course our test plan also included the usual development test, as checking that the database is modified correctly by looking directly into the database. 22

Requirement code

Unit test

1.1.1 1.1.2 1.2.1 1.2.2 1.2.3 1.2.4 1.2.5 1.2.6 1.3.1 1.3.2 1.3.3 1.3.4 1.3.5 1.3.6 1.3.7 1.4.1 1.4.2 1.4.3 1.5.1 1.6.1 1.7.1 2.x.x

X X

Bench test

Database check

Integration Initial test code check X X

Cross compilation

Architecture validation

X X X X

X X

X

X

X X X X X X X X X X X

X X

X X X X X X X

Table 2.3: Requirement and test correspondence matrix

Code check Our work on the existing igraph library was made with some hypothesis. The main one was the igraph architecture in layers was respected. If the upper layer functions did not used the interface properly the whole project would have been at sack. So we had to check the code of the entire library to be sure that all the functions call the basic data access layer and do not directly access andor use the graph structure. Architecture validation Some requirements were not testable. They just correspond to architectural choices that we had to take.

2.3.2

Benches

Goal Performance of our implementation was very important. Indeed, as igraphdb is now using the hard disk to store data, we were expecting to experiment a loss of speed while, on the contrary, decreasing the memory usage. So, in order to quantify these changes, we developed a set of test benches. Requirements

Before developing these test benches, we thought of some requirements :

• These benches shall be easy to run • The results given back by the application shall be easy to interpret • These benches shall run on the two versions igraph and igraphdb 23

• The results shall give a summary of used CPU time and memory Tests The functions used by the test benches were considered the most used. For more information about these functions, see section C.6.2. Implementation The internal implementation of the test benches is simple. It just consists in two concurrent threads : 1. Thread 1 runs the function and compute CPU time used at the end 2. Thread 2 updates the memory usage by polling data exported by the Linux kernel about the process Limitations As explained before, the test benches are using data exported by the Linux kernel. Therefore, they cannot be run on Microsoft Windows. Moreover, these data concerns the whole process and not only the thread running the function. So, even if we tried to be the most accurate by calculating the size of the process before the tested function is run, the data might be approximative. Lastly, we did also not : • measure the number of SQL requests done (see requirements C.3.2.5) : as cache was not entirely implemented, this measure was not interesting • compute “Max Clique computation” (see bench 4), because computation time was definitely too long. • use the graph 5 (Web_big) as it was not provided in time by the client.

2.4

Performance measure results original v1

Dicosyn_V ' 6Mo ' 2Mo

wikipedia_fr ' 736Mo ' 7Mo

web_small ' 221Mo ' 2Mo

Maximum memory when querying neighbors for all vertices usage with mySQL on computer 2

Table 2.4: Memory usage measure results

neighbors for all vertices “prox” for all vertices clustering coefficient* clustering coefficient average max length*

original ' 0.01s ' 78.2s ' 0.05s ' 0.66s ' 44.71s

v0 ' 674s N/A N/A N/A ' 1442s

v1 ' 3.9s ' 1448s ' 9.22s ' 60.73s ' 54s

v2 N/A N/A N/A N/A N/A

CPU time on Dicosyn-V with SQLite and computer 1

Table 2.5: Version performance comparison

Remark Functions marked with a * create a memory structure called adjlist1 , which contains the neighborhood vector of every vertex. It permits to extract those vectors from the graph structure only once, 1 structure provided by igraph , see http://cneurocvs.rmki.kfki.hu/igraph/doc/html/igraph-Adjlists. html

24

clustering coefficient for Dicosyn clustering coefficient for Wikipedia-fr

original ' 0s ' 161s

SQLite ' 28s 's

mySQL ' 10s ' 3533s

PostgreSQL ' 11s ' 3533s

CPU time with Computer 2, version 1 of igraphdb used

Table 2.6: Database performance comparison

instead of doing it each time neighborhood is needed. This structure size is, in the worst case (for undirected graph), 2 × |E| × 8 bytes2 Table 2.4 shows maximum memory usage. As expected database version needs really less memory since data is stored in database. This experiment also permits to check that original igraph memory structure needs about (2×|V |+4×|E|)×8 bytes3 to store a graph, and additional 2×|E|×8 bytes are needed for the stared functions because of the adjlist structure. With igraphdb this structure is also created in the physical memory, so 2 × |E| × 8 bytes are needed. That is still significantly less than (2 × |V | + 6 × |E|) × 8 bytes needed by original igraph , and it permits a limited loss of performances. Indeed this loss only depends of graph size, not of algorithm complexity. However with more larger graphs even 2 × |E| × 8 bytes will not fit into physical memory, therefore either started algorithms should be rewritten, either a new version of adjlist should be implemented. This new version could be designed in order to work with the cache layer and provide vectors from it. Table 2.5 gives a comparison of different versions of igraphdb . As expected indexes created by the version 1 are essential. Indeed they change complexity from O(n) to O(log(n)). Table 2.6 gives a comparison of database management system. Using ODBC it’s easy to switch from one database system to an other. We seen that SQLite it efficient for small graphs, but for larger ones MySQL or PostgreSQL appears significantly faster. Conclusion To sum up, the database version is from 20 to 100 times slower than memory one. It was expected as it is approximatively the same ratio between the physical memory access speed and the hard disk access speed. The issue of memory size limit has now been turned into an issue of computation time. We have to say that our version will be particularly useful for extremely large graphs, when experimentations work on local parts of the graph. Indeed only the local considered part of the graph will be loaded into the cache instead of the entire graph.

2 |E| 3 |V

is the number of edges of the graph | is the number of vertices of the graph

25

Conclusion 2.5

What did we achieve ?

We wanted to develop three versions of our software. The two ones which do not use any caching system are working fine. However the version with the cache is not completely achieved, because its design phase took time. However the cache architecture and its major functions are already coded, so it is now easy to add other functions. We have also developed unit and bench tests, which are reusable for another project.

2.6

What have we learned ?

This project has been our first true experience of project management. We had to realize each main project management step, from specification to testing, and we have worked full-time on them. We have also learned to use many tools, like GanttProject or Trac. Moreover, as we have modified a “real” software, we had to deal with existing code written by other people. Hence we needed to do an handover phase before starting modifying igraph and we had to adapt to the conventions and coding style used in this library. Finally this project has allowed us to increase our programming skills with the C language, which remains even nowadays one of the major programming languages.

2.7

What next ?

The original igraph library is licensed under GPL. That is why we would like to support it and give our modifications back to the community. So we will add the GPL-related notices in our code, and contact the igraph development team as soon as possible. We have also thought about many improvements : manage attributes (like vertex weight) with database, optimize the cache design to take small worlds specificities into account, optimize some upper layer functions for the database version, or manage concurrent access to the database.

26

Appendix A

Appendix A : tutorials A.1

Installation tutorial on Linux

First uncompress the igraphdb tar file into a temporary directory : tar xzf igraphdb.tar.gz cd \igraphdb Then to install the complete C library, type : ./configure make make install You will certainly need root privileges for the last command. You can also try : ./configure --help to see installation options and read the INSTALL file. If you want to install the library into a specific folder, use for instance : ./configure --prefix=/home//igraphdb_install make make install

A.2

Installation tutorial on Windows

In order to install igraph on Windows, you can use : • Cygwin • or the combination of MinGW and MSYS

27

A.2.1

Cygwin or MinGW ?

A.2.1.1

Cygwin

Cygwin is a Unix-like environment for Microsoft Windows. It consists of two parts : • a DLL (cygwin1.dll) which acts as the Linux API emulation layer • a collection of tools which provide a Linux look and feel The library implements the POSIX system call API in terms of Win32 system calls. The collection of tools contains a GNU development toolchain (GCC, GDB, ld, ...) and many ports of Unix programs. The main drawback of Cygwin is that it prefers to sacrifice performance against compatibility and, therefore, the library must be included with all projects compiled with Cygwin in order to work. A.2.1.2

MinGW and MSYS

MinGW (Minimalist GNU for Windows) started as a fork of Cygwin. But the main difference resides in its approach. Indeed, MinGW aims to provide native functionality via direct Windows API calls, prioritizing performance. Moreover, this approach allows to build native Windows applications and/or libraries and does not need a compatibility library to run these. The main drawback of this approach is that some Unix programs may not work and even compile with MinGW. MSYS (Minimal SYStem) provides a lightweight Unix-like shell containing sufficient tools to run autotools’ scripts. A.2.1.3

Conclusion

As igraphdb does not rely on difficult POSIX calls, it makes MinGW a great choice because of its ability to produce native Windows libraries and therefore efficient ones. Because performance is the core of this library, we want to keep it that way on each platform it will run.

A.2.2

Installation of MinGW

See the official site for installation instructions : http://www.mingw.org/wiki/Getting_Started. Follow the manual installation and make sure you download at least these programs : • binutils • gcc-core • mingw-runtime • w32-api • gcc-g++ • mingw-gdb • mingw32-make • mingw-utils

28

A.2.3

Installation of MSYS

See the official site for installation instructions : http://www.mingw.org/wiki/MSYS. More than installing the recommended applications and updates, you need to install these supplementary tools : • flex • crypt • perl • regex You can find them on MSYS sourceforge page1 . Warning : Make sure you use MSYS to decompress the archives.

A.2.4

libxml2

To compile igraphdb, you will need to build libxml2 from source2 . To do that, copy the archive into "C:\msys\1.0\home\name_of_user" , launch MSYS and then type : > > > > > >

cd /home/name_of_user tar -xvzf libxml2-2.6.30.tar.gz cd libxml2-2.6.30 ./configure --prefix=/mingw make make install

A.2.5

Compiling igraphdb

To compile igraphdb on Windows, launch MSYS and then follow the same instructions as for Linux but instead of typing > make type > make CFLAGS="${CFLAGS} -DWin32" This line is used to avoid a compiling error about missing headers that are not needed anyway on Windows. 1 http://sourceforge.net/project/showfiles.php?group_id=2435&package_id=67879 2 ftp://ftp.gnome.org/pub/GNOME/sources/libxml2/2.6/libxml2-2.6.30.tar.gz

29

A.3

Getting started

Now, how to use the library to run your own programs ? You will have to use compilation flags and environment variables. During a compilation, the compiler and the linker need : • the location of the library include files, i.e. .h files • the location of the library itself This will be given by compilation flags : • -I • -L • -l Here is an example of a makefile for compiling a test program : CC= g c c IGRAPHLIB = / home / < l o g i n > / i g r a p h d b _ i n s t a l l CFLAGS=−W −Wall −a n s i −p e d a n t i c −I $ ( IGRAPHLIB ) / i n c l u d e / i g r a p h d b LDFLAGS=−L$ ( IGRAPHLIB ) / l i b −l i g r a p h d b EXEC= i g r a p h d b _ 1 _ t e s t i g r a p h d b _ 2 _ t e s t a l l : $ (EXEC) %_ t e s t : %_ t e s t . o @echo "exe generation" $ (CC) −o $@ $ ^ $ (LDFLAGS) %.o : %. c @echo ".o generation" $ (CC) −o $@ −c $< $ (CFLAGS) clean : rm − r f * . o i g r a p h _ t e s t mrproper : clean rm − r f \ $ (EXEC)

Now you should be able to launch your program. However if you have chosen a specific folder for the installation of the library, you will get an error like this : error while loading shared libraries: libigraph.so.0: cannot open shared object file: No such file or directory This error is due to the fact that we are using a shared library whose installation location is not known by the linker. So you have to specify it yourself. To do so, you can use the LD_LIBRARY_PATH environment variable : export LD_LIBRARY_PATH=$(LD_LIBRARY_PATH):/home/ /igraphdb_install/lib/

30

A.4

Creating and using a DSN

A.4.1

With Linux

If you do not have it yet, install the unixodbc package then launch ODBCConfig in a terminal with the command : ODBCConfig You can choose to create a new DSN (Data Source Name) for the whole system or only a specific user. Just click on the Add button, select one of the ODBC drivers in the list of ODBC drivers present on your system and click on OK. Then, choose a name for your DSN and the file where the database will save data. If your distribution does not provide ODBCConfig, you will have to manually modify those files : • /etc/odbc.ini for a system dsn • /home//odbc.ini for a user dsn • /etc/odbcinst.ini

A.4.2

With Windows

Windows also provides an interface for managing all your DSN and it is as easy as ODBCConfig on Linux. You will find this interface in the Control Panel -> Administrative Tools -> Data Sources(ODBC).

A.5

List of compatible databases

As explained before, igraphdb is an implementation of igraph using ODBC. Hence, every database should be compatible if an ODBC driver is provided. Here is the list of databases on which we have already tested the compatibility : • SQLite • MySQL • PostgreSQL

31

Appendix B

Appendix B : API and internal implementation B.1

Database architecture Field graph_id directed nb_vertices

Type int(11) int(11) int(11)

Null Yes Yes Yes

Default NULL NULL NULL

Table B.1: Structure of table graphs_index Field edge_id source target

Type int(11) int(11) int(11)

Null Yes Yes Yes

Default NULL NULL NULL

Table B.2: Structure of table graph_edges_X The tables and index names are defined in igraph.h. By default, they are: # define # define # define # define

B.2

IGRAPHDB_GRAPH_INDEX_TABLE "graphs_index" IGRAPHDB_GRAPH_TABLE_PREFIX "graph_edges_" IGRAPHDB_GRAPH_INDEX_SOURCE "index_source_" IGRAPHDB_GRAPH_INDEX_TARGET "index_target_"

Memory data structure architecture

The following structures have been use to hold information about graphs in memory. The igraph_t structure already existed in the original igraph source code. It was the main structure, which aimed at storing graph data like vertex or edge vectors. We have deeply modified it, as data are now stored in a database. The other structures were added for specific purposes, like storing pre-compiled requests handlers or the cache structure (in version 2).

32

B.2.1

igraph_t structure

B.2.1.1

Version 0

typedef struct igraph_s { / * i d o f t h e g r a p h i n t h e t a b l e g r a p h s _ i n d e x o f t h e DB * / long i n t graph_id ; / * DSN o f t h e d a t a b a s e c o n t a i n i n g g r a p h s * / char * i g r a p h d b _ d s n ; /* the necessary handlers : environment , connection , statement ( request ) */ SQLHENV env ; SQLHDBC con ; /* a t t r i b u t e s */ void * a t t r ; } igraph_t ;

B.2.1.2

Version 1

typedef struct igraph_s { / * i d o f t h e g r a p h i n t h e t a b l e g r a p h s _ i n d e x o f t h e DB * / long i n t graph_id ; i g r a p h _ b o o l _ t i s _ d i r e c t e d ; / * whether t h e graph i s d i r e c t e d or not * / l o n g i n t e c o u n t ; / * Number o f e d g e s * / l o n g i n t v c o u n t ; / * Number o f v e r t i c e s * / / * DSN o f t h e d a t a b a s e c o n t a i n i n g g r a p h s * / char * i g r a p h d b _ d s n ; /* the necessary handlers : environment , connection , statement ( request ) */ SQLHENV env ; SQLHDBC con ; /* handler statement container , to store precompiled requests */ hstmt_container_t * hstmt_cont ; /* a t t r i b u t e s */ void * a t t r ; } igraph_t ;

B.2.1.3

Version 2

typedef struct igraph_s { / * i d o f t h e g r a p h i n t h e t a b l e g r a p h s _ i n d e x o f t h e DB * / long i n t graph_id ; i g r a p h _ b o o l _ t i s _ d i r e c t e d ; / * whether t h e graph i s d i r e c t e d or not * / l o n g i n t e c o u n t ; / * Number o f e d g e s * / l o n g i n t v c o u n t ; / * Number o f v e r t i c e s * / / * DSN o f t h e d a t a b a s e c o n t a i n i n g g r a p h s * / char * i g r a p h d b _ d s n ; /* the necessary handlers : environment , connection , statement ( request ) */ SQLHENV env ;

33

SQLHDBC con ; /* handler statement container , to store precompiled requests */ hstmt_container_t * hstmt_cont ; /* cache */ igraphdb_cache_t cache ; /* a t t r i b u t e s */ void * a t t r ; } igraph_t ;

B.2.2

igraphdb_cache_t structure

This structure represents the cache and is only used for v2. typedef struct igraphdb_cache_s { / * XXX t e m p o r a r y s o l u t i o n : s h o u l d be o n l y n e c . i n f o f o r DB c o n n e c t i o n * / i g r a p h _ t * graph ; / * number o f e n t r i e s i n t h e v e r t i c e s t a b l e , * i e number o f ABR r o o t s * t h i s MUST be a power o f 2 */ size_t table_size ; / * i n t e r n a l v a l u e t o q u i c k l y compute t a b l e o f f s e t g i v e n a v i d * / size_t table_offset_mask ; / * Maximum a l l o w e d s i z e f o r t h e cache , i n b y t e s * / s i z e _ t max_size ; /* current t o t a l size */ size_t current_total_size ; / * c u r r e n t s i z e u s e d by t h e s t o r e d v e r t i c e s , i n b y t e s * / size_t current_vc_size ; / * h a s h f u n c t i o n t o u s e on ID , t o " r a n d o m i z e " them s o a s t o b a l a n c e BST * / cache_hasher_f hash_fun ; / * V e r t i c e s ABRs i n d e x t a b l e * / b s t r e e _ n o d e _ t ** c v _ r o o t s ; /* Score doubled l i n k e d l i s t */ list_t *cv_score_list ; } igraphdb_cache_t ;

B.2.3

hstmt_container_t structure

This structure was created for the optimization of the v1 but it is not yet complete. Its role should be to store any pre-compiled request that igraphdb can use. This way, every pre-compiled request can be created once for all during the initialization. Only the pre-compiled requests of the igraph_neighbors function are currently stored. typedef struct hstmt_container_s { SQLHSTMT s t m t _ g e t _ n e i g h b o r s _ d i r e c t e d _ i n ; SQLHSTMT s t m t _ g e t _ n e i g h b o r s _ d i r e c t e d _ o u t ; SQLHSTMT s t m t _ g e t _ n e i g h b o r s _ d i r e c t e d _ i n _ c o u n t ; SQLHSTMT s t m t _ g e t _ n e i g h b o r s _ d i r e c t e d _ o u t _ c o u n t ; long i n t stmt_param_vertex_id ; } hstmt_container_t ;

34

B.3

Database related functions API

As required we stricly respect the igraph basic interface, however database use implies new parameters like ODBC dsn and new functions like opening a connection to an existing database.

B.3.1

Environment variables

In order to specify the ODBC DSN to use in case of graph creation (with igraph_empty), one environment variable can be use. Its name is IGRAPHDB_DSN. The default value is igraphdb_dsn. This constant is defined in igraph.h : # d e f i n e IGRAPHDB_DSN_ENVVAR_NAME "IGRAPHDB_DSN" # d e f i n e IGRAPHDB_DEFAULT_DSN "igraphdb_dsn"

B.3.2

Added functions API

B.3.2.1

int igraph_open_connection (igraph_t ∗ graph, char ∗ dsn, long int graph_id)

Create a igraph_t structure connected with the graph of given id in the database pointed by the given dsn. igraph_open_connection Parameters: graph a pointer toward an igraph_t structure dsn the dsn name graph_id the id of the graph in the database Returns: Error code: IGRAPH_DB_EALLOCATION IGRAPH_DB_EREQUEST IGRAPH_DB_EFETCH int igraph_destroy_in_db (igraph_t ∗ graph)

B.3.2.2

Destroy the graph in the DB and its reference in the table graphs_index This method should not be used directly unless you want to manually clean the DB. igraph_destroy_in_db Parameters: graph a pointer toward an igraph_t structure dsn the dsn name graph_id the id of the graph in the database Returns: Error code: IGRAPH_DB_EALLOCATION IGRAPH_DB_EREQUEST

B.4

Unit test description

This section describes which features of every function are tested by unit tests. 35

B.4.1

Graph Constructors and Destructors

B.4.1.1

igraph_empty

• test directed and undirected graphs • test the creation of the object • test for 0 edge graphs and for common (10) number of edges graphs • test the return codes for valid and non valid edges number B.4.1.2

igraph_empty_attrs

• test directed and undirected graphs • test the creation of the object • test for 0 edge graphs and for common (10) number of edges graphs • test the return codes for non valid edges number B.4.1.3

igraph_copy

• test the return codes • test the object creation • test the number of edges and vertices • test the direction B.4.1.4

igraph_destroy

• test the return code B.4.1.5

igraph_open_connection

• not tested yet B.4.1.6

igraph_destroy_in_db

• not tested yet

B.4.2

Basic Query Operations

B.4.2.1

igraph_vcount

• test the normal use of the function B.4.2.2

igraph_ecount

• test the normal use of the function 36

B.4.2.3

igraph_edge

• test the return code B.4.2.4

igraph_get_eid

• test the return code • test the creation of the objects B.4.2.5

igraph_neighbors

• test directed and undirected graphs • test of the 3 modes (in, out, all) • test that the results are sorted • test the return codes for invalid vertex ids and invalid modes • do not test the n¨ ot enough memory¨return code B.4.2.6

igraph_adjacent

• test directed and undirected graphs • test of the 3 modes (in, out, all) • test the return codes for invalid vertex id and invalid mode B.4.2.7

igraph_is_directed

• test directed and undirected graph B.4.2.8

igraph_degree

• test directed and undirected graphs • test all combinations of the 3 modes (in, out, all) with loops • test the return codes for invalid vertex ids and invalid modes

B.4.3

Adding and Deleting Vertices and Edges

B.4.3.1

igraph_add_edge

• test directed and undirected graphs • test empty and non empty graphs • test that the edge is well directed • do not test the return codes

37

B.4.3.2

igraph_add_edges

• test directed graphs • test the number of edges • test that the edges are well directed • test the return codes for odd edge vector length, negative vertex ids and too high vertex ids B.4.3.3

igraph_add_vertices

• test the new number of vertices after a successful call • test the return codes with a negative parameter B.4.3.4

igraph_delete_edges

• test the number of edges • test that the edges are well deleted (with the igraph_neighbors function) • test the return codes for non existing edges B.4.3.5

igraph_delete_vertices

• test the number of vertices • test the return codes for non existing vertices • do not test the interaction with edges

38

Appendix C

Appendix C : Requirements Notations Priority Priority : 1 Priority : 2 Priority : 3

Mandatory requirement Important requirement Less important requirement

C.1

Functional requirements

C.1.1

Functions of the basic interface

Requirements C.1.1.1 to C.1.1.2 apply to every function defined in tables C.1, C.2 and C.3. Requirement C.1.1.1 Functions shall keep the same functionality as in original igraph. Priority : 1 Requirement C.1.1.2 Functions shall keep the same signature as in original igraph. Priority : 1

C.1.2

Constructors and destructors of the basic interface

Requirements C.1.2.1 to C.1.2.4 apply to the functions defined in tables C.1. Requirement C.1.2.1 User should be able to set which database is used. Rationale : This argument does not exist in original igraph, and signatures must not change (see requirement C.1.1.2) Priority : 1 Requirement C.1.2.2 User should be able to set the maximum memory size used by the cache before creating a graph. Priority : 2

39

Requirement C.1.2.3 Destructors must clean the memory, and close the database connection. Rationale : To destroy the graph into the database see requirement C.1.2.6. Priority : 1 Requirement C.1.2.4 Destructors shall not destroy data in the database. Priority : 1 Requirement C.1.2.5 It shall be possible to easily access to an existing graph stored into a database. Rationale : A constructor which opens an existing graph by creating a connection to the database. Priority : 1 Requirement C.1.2.6 It shall be possible to destroy a graph object in the database. Rationale : Certainly with a destructor which, besides closing the connection to the database, deletes the graph in the database. Priority : 1

C.1.3

Query operations of the basic interface

Requirements C.1.3.1 to C.1.3.7 apply to every function defined in table C.2. Requirement C.1.3.1 Functions shall query the database to get data. Priority : 1 Requirement C.1.3.2 Functions shall manage a cache and try to get data from it, before querying the database. Priority : 2 Requirement C.1.3.3 The cache policy shall be defined in a scoring function. Rationale : A score is computed for each data in the cache, smaller scores are eliminated first. Priority : 2 Requirement C.1.3.4 The scoring function shall be inversely proportional to the time since the last access. Rationale : “older” data are removed first from the cache. Priority : 2 Requirement C.1.3.5 The scoring functions shall be optimized to take small-world properties into account. Rationale : Scoring function can depend on the time since the last access, the total number of access and properties of data itself. Priority : 3 Requirement C.1.3.6 Functions shall be at most 10 times slower than original igraph’s functions. Performance is given by bench test 1 on graph 1 (see section C.6). Priority : 2 Requirement C.1.3.7 Functions shall be at most 3 times slower than original igraph’s functions. Performance is given by bench test 5 on graph 1 (see section C.6). Rationale : To reach this performance

40

caching will certainly be needed. Priority : 3

C.1.4

Modification operations of the basic interface

Requirements C.1.4.1 to C.1.4.3 apply to every function defined in table C.3. Requirement C.1.4.1 Functions shall use a database to store the graph. Rationale : When there is no cache, modification operations just have to update data in the database. Priority : 1 Requirement C.1.4.2 Functions shall use a database to store the graph, and erase the entire cache upon modification. Rationale : The cache content can become inconsistent if there is a modification of the graph in the database. The solution required here is to invalidate the entire cache at each modification. Priority : 2 Requirement C.1.4.3 Functions shall use a database to store the graph, and erase from the cache only the data invalidated by the modification. Rationale : Same problem than in requirement C.1.4.2 but with a better optimized solution. Priority : 3

C.1.5

Useful upper functions

Requirement C.1.5.1 applies to every function defined in table C.4. Requirement C.1.5.1 Functions shall use the basic interface to access the graph. Priority : 1

C.1.6

Python wrapper functions

Requirement C.1.6.1 applies to every function used in python wrapper. Requirement C.1.6.1 Functions shall use the basic interface to access the graph. Rationale : The goal is to have the python interface working over igraphdb . Priority : 2

C.1.7

Other functions

Requirement C.1.7.1 Functions shall use the basic interface to access the graph. Priority : 3

41

C.2

Environment requirements

C.2.1

Operating system

Requirement C.2.1.1 The library shall work on Linux (32 bits). Priority : 1 Requirement C.2.1.2 The library shall work on Linux (64 bits). Priority : 1 Requirement C.2.1.3 The library shall work on Windows XP (32 bits). Priority : 2 Requirement C.2.1.4 The library shall work on Windows Vista (32 bits). Priority : 3

C.2.2

Database

Requirement C.2.2.1 The library must be able to use SQLite. Priority : 1 Requirement C.2.2.2 The library shall be adaptable to any database using ODBC. Priority : 3

C.3

Documentation requirements

C.3.1

Tutorials

Requirement C.3.1.1 Detailed tutorials shall be provided to explain how to install igraph on Windows. Priority : 1 Requirement C.3.1.2 Detailed tutorials shall be provided to explain how to install igraph on Linux. Priority : 1 Requirement C.3.1.3 Detailed tutorials shall be provided to explain how to write a basic C program using igraph_db. Priority : 1 Requirement C.3.1.4 Detailed tutorials shall be provided to explain how to write a basic Python program using igraph_db. Priority : 1

42

C.3.2

Architecture and performance

Requirement C.3.2.1 A document shall detail the architecture of the database. Priority : 1 Requirement C.3.2.2 A document shall detail the architecture of the memory cache. Priority : 1 Requirement C.3.2.3 A document shall give the CPU time needed to compute all the bench tests defined in section C.6.2 on all the graphs defined in section C.6.1 when the computation is possible. Priority : 1 Requirement C.3.2.4 A document shall give the memory space needed to compute all bench tests defined section C.6.2 on all graphs defined in section C.6.1 when the computation is possible. Priority : 1 Requirement C.3.2.5 A document shall give the number of SQL query needed to compute all bench tests defined section C.6.2 on all graphs defined in section C.6.1 when the computation is possible. Priority : 1

Remark : Concerning requirements C.3.2.3 to C.3.2.5 : for some bench tests on some graphs the needed computation time will be too long to achieve the test.

43

C.4

Function lists : The basic interface

name igraph_empty igraph_empty_attrs igraph_copy igraph_destroy

functionality Creates an empty graph with some vertices and no edges. Creates an empty graph with some vertices, no edges and some graph attributes. Creates an exact (deep) copy of a graph. Frees the memory allocated for a graph object. Table C.1: Graph Constructors and Destructors

name igraph_vcount igraph_ecount igraph_edge igraph_get_eid igraph_neighbors igraph_adjacent igraph_is_directed igraph_degree

functionality The number of vertices in a graph. The number of edges in a graph. Gives the head and tail vertices of an edge. Get the edge id from the end points of an edge. Adjacent vertices to a vertex. Gives the adjacent edges of a vertex. Is this a directed graph ? The degree of some vertices in a graph.

Table C.2: Basic Query Operations name igraph_add_edge igraph_add_edges igraph_add_vertices igraph_delete_edges igraph_delete_vertices

functionality Adds a single edge to a graph. Adds edges to a graph object. Adds vertices to a graph. Removes edges from a graph. Removes vertices (with all their edges) from the graph.

Table C.3: Adding and Deleting Vertices and Edges

44

C.5

Functions list : useful upper functions

name igraph_shortest_paths igraph_average_path_length igraph_transitivity_avglocal_undirected igraph_transitivity_undirected igraph_decompose igraph_maximal_cliques igraph_erdos_renyi_game

functionality The length of the shortest paths between vertices. The average geodesic length in a graph. The Average local transitivity (clustering coef.). The transitivity (clustering coef.). Decompose a graph into connected components. Find all maximal cliques of a graph Generates a random (Erdos-Renyi) graph.

Table C.4: Useful upper layer functions

C.6

Bench tests

C.6.1

Example graphs

These graphs are provided by the client. Example graph 1 : (DSV) DicoSyn Verb: synonyms dictionary of French verbs. Example graph 2 : (Rand_DSV) A random graph (Eerdos Renyi type) with the same number of vertices and edges than DSV. This graph is randomly created but the same should be used any time. In other words it is a specific graph. Example graph 3 : (Wikipedia_fr) The graph of Wikipedia (fr) pages. A directed link exists from a page A to a page B if there is a link in the page A to reach the page B. Example graph 4 : (Web_small) A graph with about 2.5 millions vertices, extract of the hyper-link graph of the web. Example graph 5 : (Web_big) A graph with about 100 millions vertices, extract of the hyper-link graph of the web.

C.6.2

Test algorithms

Bench test 1 : (All Neighbors Query) Ask for neighbors of each node, by using igraph_neighbors. Bench test 2 : (C computation) Computation of the average local transitivity igraph_transitivity_avglocal_undirected. Bench test 3 : (L computation) Computation of the average geodesic length by igraph_average_path_length. Bench test 4 : (Max Clique computation) Computation of all maximal cliques by igraph_maximal_cliques. Bench test 5 : (Prox computation) Computation of the “proxemy vector” of each node of the graph. Proxemy vector is defined in [?, section 6]. 45