Fall 2005
Structural Data Mining and Modeling
Katy Börner
Project 3
(Analyze and visualize a structural data set)
Handout 3
September 27, 2005

Due Monday 10-17-2005 at 8pm (~3 weeks)
Prepare to give a 5 min team presentation of the data set(s) you selected and the planned data analysis in Lab 6 on 10-4-2005.

In this project you will work in teams of two to analyze and visualize a structural data set. In particular, you will analyze and visualize the structure and/or dynamics of a knowledge domain of your choice. For a recent review of this line of research consult:

Börner, Katy, Chen, Chaomei, and Boyack, Kevin. (2003). Visualizing Knowledge Domains. In Blaise Cronin (Ed.), Annual Review of Information Science & Technology, Volume 37, Medford, NJ: Information Today, Inc./American Society for Information Science and Technology, chapter 5, pp. 179-255.

Please work in self selected teams of two people.
The Network Analysis Toolkit, UCINET 6, and customized perl scripts will be used for the data analysis. Pajek and HistCite will be used to provide simple network visualizations.

Important Notes:
If you need help in writing parsers to produce the supported input data file formats please contact Weimao Ke <wke@indiana.edu>.
If you are interested to analyze large scale data sets (for which you can't use the
Network Analysis Toolkit) please do let me know ASAP.

This project has the following parts:
(1) Selection of one or more citation data set(s).
(2) Decide what network to analyze & visualize.
(3) Get summary statistics and determine network properties for those network(s).
(4) Data analysis.
(5) Data visualization.
(6) Validation.
(7) Discussion of results.
All seven parts will are detailed subsequently.

(1) Selection of one or more citation data set(s)
Select one of the two given data sets:

or download a data set of your choice from ISI Web of Knowledge.

Note: It is highly recommended that you have expertise in the data you are mapping. That is, if you select social network research make sure you have knowledge in this field or you know a social network researcher who is interested and available to help you validate/interpret/optimize the data analysis results and visualizations.

To download your own data set, connect to http://isiknowledge.com and select 'Web of Science'. On the next screen, select 'General Search' or Cited Reference Search' or 'Advanced Search' and select the citation database and timespan you are interested in. On the next screen, enter search terms then click 'Search'. On the next screen, select papers of interest. Search for cited work and/or work that cite these papers. Mark desired papers. ... here comes some iterative, time consuming work ... Click 'Marked List' button (top menu). On the next screen, select the fields to include in the output. Make sure you check 'cited references' and 'times cited'. Click 'Save to file' button to download your marked papers to the desktop. When 'File Download' window appears, click "Save file to disk." Rename .txt file and save. To verify that the saved .txt file is complete, check to see that it ends with an 'ER' tag. If not, you must redo the download. Reduce the number of records on the marked list until you are able to capture them all. WoS times out after a few minutes. See also ISI Basics help file.

On Handin web page:
Provide information on your data set. If you compile your own data set, explain in detail what search terms you used and how you conducted the search. Should not be more than 1/2 page on a letter size printout.
Please provide links to the raw, ISI formatted data as well as to any other UCINET's DL format/Pajek net format you use.

(2) Decide which network to analyze & visualize
Citation data provide information about co-authorship network, paper-citation networks as well as the diffusion of knowledge directly via co-authorships and indirectly via the internalization of other authors’ papers. While co-authorship networks and paper-citation networks are typically considered separately, the study of knowledge diffusion will require to analyze the ecology of co-author and paper-citation networks.
You may like to quickly probe what networks, analyses, and visualizations results in the most interesting results. Very likely, nobody has ever generated a map of this data set before. Hence, you will be the ones to discover the structural and dynamic properties of this data set/topic area.

Some common choices would be:

Note that only the first and the third choice provide the opportunity to combine link and content information.

On Handin web page:
Explain your decision. Should not be more than 1/2 page on a letter size printout.

(3) Get summary statistics and determine network properties for those network(s)
For each data set determine the number of nodes, number of edges, and other network properties you find relevant.

On Handin web page:
Provide info (table is best) with summary statistics and network properties. Should not be more than 1/2 page on a letter size printout.

(4) Data analysis
This is the most important part. Without a good data analysis you won't be able to generate a meaningful network layout nor get positive feedback in the validation part.

Some common data analysis choices (of similar difficulty) would be:

Please, please do feel free to think about other meaningful ways to analyze your data sets. Often, it is a matter of playing with the data for a while - getting your hands dirty - that gives the best results.  Be creative. Feel free to send me your ideas for comments.

Note: All teams will present the data set(s) they selected together with their plans for the data analysis in Lab 6 on 10-4-2005. This will provide an opportunity to give constructive feedback, pointers to relevant work, etc.

On Handin web page:
Explain in detail the data analysis you performed. Provide links to main (intermediate) result data files. Make sure that if somebody else reads this explanation s/he is able to obtain the very same results you got. Should not be more than 1 page on a letter size printout.

(5) Data visualization
Plot the 'main' structure or the most interesting structural properties of your network(s).
Use Pajek or UCINET dendrograms (Tools > Dendrogram > Draw) to visualize the structure of the data sets.

Most data sets will be too large to be displayed without cluttering the screen. Use threshold values, betweenness centrality clustering or Pathfinder Network Scaling to eliminate nodes that have too low degrees, show only 'strong ties', etc. Use node color coding and labeling to identify most important nodes.

Use Paint, Photoshop, Fireworks or any other drawing program to add node cluster boundaries and/or labels, use arrows, etc. to identify interesting features and parts of the data sets. Briefly discuss interesting structural properties.

On Handin web page:
Provide several views of your data set(s). Clickable thumbnail images which lead to full resolution images are best.
Explain how you generated the visualizations. Add a brief discussion of interesting structural properties. Should not be more than 1-2 pages (figures take up most of the space) on a letter size printout.

(6) Validation
Validate the generated visualizations by having experts comment on them.  Do they agree on the importance/dominance of certain nodes (here words, papers, or authors)? Do they confirm the structure of this network - major links, number of clusters, cluster size, etc.? What do they miss? Use at least two experts.

On Handin web page:
Add the validation  - it should not be more than 1 page on a letter size printout.

(7) Discussion of results
Discuss major data analysis, visualization, and validation results.

Relate your results to existing work. For example, compare and discuss power law exponent values and small world properties you extracted with the data given in Table I and II in http://arxiv.org/PS_cache/cond-mat/pdf/0106/0106096.pdf. If applicable, compare your map to other maps of this domain.

Please provide complete references to all work you discuss. Please do provide links to papers if they are online available. For example, the paper linked above would be cited as:

Reka Albert and Albert-Laszlo Barabasi. Statistical mechanics of complex networks. Reviews of Modern Physics, Volume 74, Number 1, January 2002, pp 47-97. Also published as cond-mat/0106096. Online available at http://www.nd.edu/~networks/PDF/rmp.pdf.

On Handin web page:
Add the discussion - it should not be more than 1 page on a letter size printout. List complete references of all papers you discuss.

Handing in
Check that your handin web page has a title, your names, and all the info listed under 'On Handin web page'.  Please strive for quality not quantity and make sure that there is an easy way to print your project result on less or equal than 6 pages (11pt or larger). Handin via  http://ella.slis.indiana.edu/~katy/handin/L597-F05/cgi/handinlogin.cgi.

Send mail to katy@indiana.edu  with questions or comments about this web site.
Last modified: 9/21/2005