Fall 2005
Structural Data Mining and Modeling
Katy Börner
Project 4
(Data analysis and modeling)

Handout 4
October 18th, 2005


In this project you will phrase a scientific research question that is of interest to you (see also Project 1) and attempt to answer it. Research questions come in diverse formats: you can (1) test hypotheses derived from grand theory, (2) investigate the relationships among a set of variables, and tell a story (i.e., construct a theory) based on the results, or (3) use structural data mining and modeling to diagnose a problem and prescribe a solution based on the diagnosis. This project differs from previous projects in that you will most likely work with a faculty/staff ‘client’ and write up the results as a research paper. Team work will allow you to take on larger projects. However, please feel free to work on this project by yourself.

This project has the following parts/deadlines:
(1) Define your project. Handin due Mon 10-24-2005, 8pm (in one week).
(2) Data analysis and visualization. Give a 10 minute presentation of your results and an outlook of your data modeling plans in Lab 10 to get feedback from everybody: Wed 11-01-2005 during Lab (two weeks later).
(3) Validation.
(4) Documentation and discussion of data analysis & visualization as well as validation results. Handin as 'Project 4-a&v' by Mon 11-07-2005, 8pm.
(5) Data modeling.
(6) Documentation of final project results. Handin as 'Project 4' by Mon 11-28-2005 at 8pm (2 weeks & Thanksgiving later).
(7) Project/Peer rating. Email due Mon 11-28-2005 at 8pm.
(8) Final presentation during Lecture and Lab on 12-06-2005.
All parts are detailed subsequently.
Handin all work except the rating via http://ella.slis.indiana.edu/~katy/handin/L597-F04/cgi/handinlogin.cgi.

The IVC, UCINET 6, and customized perl scripts will be used for the data analysis. Pajek will be used to provide simple network visualizations. Feel free to utilize VxInsight and Chizu also. General perl parsers are at http://ella.slis.indiana.edu/~kmane/katy/katy_l597/perl_scripts/.

Important Notes:
If you need help in writing parsers to produce the supported input data file formats please contact Weimao Ke <wke@indiana.edu>.
If you are interested to analyze large scale data sets (for which you can't use the
Network Analysis Toolkit) please do let me know ASAP.
If you encounter any trouble with the Network Analysis Toolkit contact Shashi Penumarthy <sprao@indiana.edu>.

(1) Define your project - Due Mon 10-24-2005, 8pm
First, you will need to define a project that you would like to work on. Some possibilities are given below. However, feel free to select your very own project.
Within a week you will need to define the scope of the project, e.g., data sets used, data analysis to be performed - algorithms needed for this analysis, initial ideas on the data modeling part (we will learn more about data modeling in class 8-11).

Project possibilities:

It is highly recommended that you have expertise in the area which you are analyzing/modeling. That is, if you select social network research make sure you have knowledge in this field or you know a social network researcher who is interested and available to serve as a 'client' in this project and helps you to validate/interpret/optimize the data analysis results, visualizations, and simple data models.

On the Handin web page provide information on your project such as

The result should fit on 1-2 pages on a letter size printout.

(2) Data analysis and visualization
Typically, you will determine summary statistics as well as network properties for the network(s) under consideration.  Please, please do feel free to explore other ways to analyze your data set(s). Often, it is a matter of playing with the data for a while - getting your hands dirty - that gives the best results.  Be creative. Feel free to send me your ideas for comments.
In addition, please plot, examine, and discuss the most interesting structural properties of your network(s). Larger data sets will be too large to be displayed without cluttering the screen. Use threshold values or Pathfinder Network Scaling to eliminate nodes that have too low degrees, show only 'strong ties', etc. Utilize node/edge color/size coding and labeling to identify most important nodes/links. Use Paint, Photoshop, Fireworks or any other drawing program to add node cluster boundaries and/or labels, use arrows, etc. to identify interesting features and parts of the data sets.

(3) Validation
Validate the data analysis results/visualizations by having experts comment on them.  Do they agree on the importance/dominance of certain nodes? Do they confirm the structure of this network - major links, number of clusters, cluster size, etc.? What do they miss? Use at least two experts.
In addition, validate your results by comparing them with related work. For example, compare and discuss  gamma values and small world properties you extracted with the data given in Table I and II in http://arxiv.org/PS_cache/cond-mat/pdf/0106/0106096.pdf. If applicable, compare your map to maps of related data sets.

(4) Documentation and discussion of data analysis & visualization as well as validation results - Due 11-07-2005, 8pm.
Explain in detail the data analysis you performed. Provide links to initial, intermediate, and final result data files. Make sure that if somebody else reads this explanation s/he is able to obtain the very same results you got. Provide several views of your data set(s). Clickable thumbnail images which lead to full resolution images are best. Explain how you generated the visualizations. Add a brief discussion of interesting structural properties. Last but not least, document your validation results. All together should not be no more than 4 pages on a letter size printout.

(5) Data modeling
Design a simple process model for the data set(s) you analyzed. Ideally, the model captures the elementary mechanisms that lead to the production of the structural data set under consideration and the networks produced by the model conform to the measured data in terms of simple statistics and network properties. We will discuss data modeling in class 8-11 and this part of the final project description will be detailed then.

(6) Documentation of final project results - Due 11-28-2005 at 8pm
Update your data analysis, visualization, and validation results if needed. Add the data modeling results comprising a detailed motivation and description of your model, properties of simulated data sets, the influence of model parameters if any, correlation of simulated and real data, discussion of related modeling work.  Add a section on 'Challenges and Opportunities' in which you may like to address complexity & scaling issues, etc. as well as desirable modifications & extensions of your work (you have only 6 weeks for this project restricting the amount of work that can be done considerably. However, I would like to know what promising avenues you see for the project.) Don't forget to include an 'Acknowledgment' section  in which you thank people who provided resources or helped you with this project. Please list complete references to all work you discuss. Do provide links to papers if they are available online.
The final documentation should be no longer than 6 pages and conform to the format specified in the template available at http://www.ils.unc.edu/jcdl2002/cfp.doc.

Link the final paper from a web page. Feel free to add links to supplementary material such as raw and compiled data, code, simulated data, screen shots, animations of simulation runs, etc. as appropriate. Please do keep in mind that I will grade the paper not the supplementary material. However, the latter will ease the re-run of model parameters, help to 'see' the effects of model parameters, provide an excellent resource for you and others to follow up on this work, etc.

Handin the link to the web page and place a complete copy of your handout in Katy's SLIS mailbox.

(7) Project/Peer Rating - Due 11-28-2005 at 8pm
Please rate on a scale of 1 (lowest) to 20 (highest) the following factors related to your group project.

Your input will remain confidential, so please be honest in your appraisals. The rating will influence the class participation (20% of the total grade for the course) as well as the peer review portion of the final project grade (5 points out of 40 points) and it may even influence individual scores for the final project.
Please send your rating via mail to katy@indiana.edu.

(8) Final presentation
Present the results of your project during Lecture and Lab on 12-06-2005.

Send mail to katy@indiana.edu  with questions or comments about this web site.
Last modified: 10/18/2005