Web communities
Theses datasets were extracted using an algorithm for community extraction. Each dataset is given in tar.gz. The archive contains html files of the sites that were found by the algorithm, an opml file listing all the sites, a gdf file which supplies the graph structure of the community and a csv file (comma separated) that gives the same information in forms of an adjacency matrix. The following communities were extracted (in 10/2009) :
Some statistics on these datasets are summarized in the next table. Community size give the number of noded of the network and alpha the edges density in the network. Other informations concern the used algorithm.
| Comics (fr) | Scrapbooking (fr) | Food (us) | Politics (us) | |
|---|---|---|---|---|
| Nb seed | 100 | 100 | 50 | 50 |
| Community size | 1 263 | 1 130 | 1 681 | 1 884 |
| Graph size | 4 1435 | 20 611 | 55 061 | 105 197 |
| Fetched pages | 2 177 | 1 739 | 2 813 | 3 601 |
| Level max | 3 | 2 | 5 | 4 |
| alpha | 0.01821 | 0.01899 | 0.03560 | 0.02091 |
| beta | 0.00093 | 0.00147 | 0.00091 | 0.00065 |
| gamma | 0.03048 | 0.05579 | 0.03060 | 0.01808 |