Data Mining and Modeling
(Analyze a structural data set)
09-26-2005 at 8pm (~ 3 weeks).
The Network Analysis Toolkit will be used for the data analysis. Pajek will be used to provide simple network visualizations.
If you need help in writing parsers to produce the supported input data file formats please contact Weimao Ke <firstname.lastname@example.org>.
If you are interested to analyze networks with more than 10,000 nodes (for which you can't use the Network Analysis Toolkit) please do let me know ASAP.
If you encounter any trouble with the Network Analysis Toolkit contact Bruce Herr <email@example.com>.
has the following parts:
(1) Selection of a structural data set. Feel free to use the data from Project 1.
(2) Getting summary statistics for the data set.
(3) Determine if network is scale free.
(4) Determine if network has small world properties.
(5) Plot the 'main' structure or the most interesting structural properties.
(6) Interpret your results and relate them to existing work.
All six parts will are detailed subsequently.
of a network:
Identify a research question that is of interest to you and identify a structural data set that might hel answer this question, see also Project 1.
DO NOT generate a subset of a network by using the first 100 lines of a data file only! If you need to sample then determine a set of highly connected nodes plus all those nodes that are interconnected to those nodes via a path of n edges. Feel free to consult with me on how to make data sets more manageable while preserving their inherent structure.
Report where the data came from, i.e., a reference to the respective book or paper or web page. Provide links to the raw data in any of the two formats that are supported by the Network Analysis Toolkit. Add a short title & an explanation of each data set. Make sure a reader gets to know what the nodes, edges and any weights or labels represent.
summary statistics for each data set:
For your dataset, determine the # Nodes, # Node types, # Edges, # Edge types, # Loops, Directed (y/n), Connected (y/n), Parallel edges (y/n), Density, Diameter, Characteristic path length, Char. path length for random network, Clustering Coefficient 1, Clustering Coefficient 2, Clustering Coefficient for random network, Average degree, Standard deviation, Variance, Scale free exponent, and R square.
Creat a table that lists all these values. Please also save your values in this table so that we can compare the 12 data sets (analyzed by the 12 students taking this class) easily. Provide a link to the completed table.
if network is scale free.
Determine and plot the frequency f of the degree of connectivity k of a vertex. This function can be saved out via the Network Analysis Toolkit.
Give plot for P(k). State if your network is scale free or not and why. Please also add your result to the table.
if network has small world properties.
Determine if your network has small world properties given the values reported in (2).
Report and explain results. Please also add your result to the table.
Plot the 'main' structure or the most interesting structural properties
of the data set.
Use Pajek to visualize the structure of the data set.
Most data sets will be too large to be displayed without cluttering up the screen. Use threshold values for eliminating nodes that have too low degrees, show only 'strong ties', etc. Use node color coding and labeling to identify most important nodes.
Use Paint, Photoshop, Fireworks or any other drawing program to add node cluster boundaries and/or labels, use arrows, etc. to identify interesting features and parts of the data sets. Briefly discuss interesting structural properties.
Handin web page:
Provide at least two visualizations of the data set. Clickable thumbnail images which lead to full resolution images are best.
Add a brief discussion of interesting structural properties.
your results and relate them to existing work.
Discuss the results you got. Do they answer your original research question?
Compare and discuss the scale free exponents and small world properties you determined with those reported in Table I and II in http://arxiv.org/PS_cache/cond-mat/pdf/0106/0106096.pdf and any other work that you find relevant.
Please provide complete references to all work you discuss. Please do provide links to papers if they are available online. For example, the paper linked above would be cited as:
Reka Albert and Albert-Laszlo Barabasi. Statistical mechanics of complex networks. Reviews of Modern Physics, Volume 74, Number 1, January 2002, pp 47-97. Also published as cond-mat/0106096. Online available at http://arxiv.org/PS_cache/cond-mat/pdf/0106/0106096.pdf.
Handin web page:
Add the discussion. Strive for quality not quantity. List complete references of all papers you discuss.
Check that your handin web page has a title, your name, and all the info listed under 'On Handin web page:'. Make sure the resulting web page (excluding higher resolution images of figures) prints on less than 4 pages. Handin via http://ella.slis.indiana.edu/~katy/handin/L597-F05/cgi/handinlogin.cgi.