L597: Structural Data Mining and Modeling (Fall 2005)

**Course Description | Grade
| Course Outline | Resources**

**Lecture: **Tue Tue
1p-2:15p & 2:30p-3:45p, LI 001

**Office hours: **Mon 3-4p,
LI 019 (Katy), Fri 2-3p, LI 020 (Weimao)

**Prerequisites: **L401
or equivalent

**Class Listserver: **katy_L597-L@listserv.indiana.edu

**Class Web Page: **http://ella.slis.indiana.edu/~katy/L597-Fall05

**Project Handin Webpage: **http://ella.slis.indiana.edu/~katy/handin/L597-F05/cgi/handinlogin.cgi

**Web Resources:** MS Excel
Primer, Pajek tutorial,
Perl
Scripts

**Recorded Talks:** Information Processing Talks 2002-2003,
Sackler Colloquium on Mapping Knowledge Domains Talks, Digital Biology Talks,
Complexity Digest 2000.31,

**Talk Series on Networks
and Complex Systems**, Spring
05, Fall 04.

This course introduces students to major methods, theories, and applications of structural data mining and modeling. It covers elementary graph theory and matrix algebra, data collection, structural data mining, data modeling, and applications.

Students will learn how to frame the research question, collect the data, run the analysis, and interpret the results. In addition, they will learn how to design and evaluate descriptive and predictive models of diverse complex networks to improve their understanding of the underlying principles.

Upon taking this course students will be able to analyze and describe real networks (power grids, WWW, social networks, etc.) as well as relevant phenomena such as disease propagation, search, organizational performance, social power, and the diffusion of innovations.

**Course Format**

A combination of lectures and 4-5
labs will be utilized to introduce material from such diverse domains as mathematics
(graph theory, matrix algebra, chaos theory), computer science (data mining,
knowledge discovery, structural pattern matching, complex network theory), information
science (bibliometrics, webometrics, social network analysis), physics (nonlinear
dynamics, statistical physics), biology (genomics, food webs), etc. MS Excel,
Matlab and Pajek will be employed for data mining and modeling. Students will
improve their collaborative skills by working in topic dependent teams.

Individual and group work will be evaluated according to how well the course material is understood and implemented into projects, as well as the quality of written and oral presentations. Students are expected to spent about 8h per week outside of class for readings and projects.

Students will be required to do weekly readings of journal articles and participate in class discussions. Class participation is evaluated on the frequency of relevant, constructive contributions that reflect a close reading of assigned materials and thoughtful reflection on the topic during class as well as via the class majordomo list and handin web pages.

The 15 min ** presentation**
will address a specific topic/question and will be based on readings
from the literature or internet. Sources will be provided. If you can
find
more that's great. See Preparation of
Presentations for more details.

You are expected to use the office hours the week before you will give the presentation to discuss your preparation with the instructor. Prepare your presentation as well as any specific questions you may have in advance.

All students will be expected to participate in class by

Diverse small projects are designed to create competence in using taught methods and the software effectively. A final research project (which may be done in groups and in collaboration with clients) is due at the end of the semester. Detailed descriptions of projects will be provided. Results have to be submitted via the handin web page.

The final grade breakdown will be as follows: class participation (20%), presentation of selected readings (10%), small projects (30%), and final project (40%). Grades are assigned according to the grading standards of SLIS.

Make sure you handin in time and your handin is accessible and readable!

**Credits:** 3 for L597

The class schedule may change as the course progresses. Changes will be posted on the L597 majordomo list.

**Introduction**

**Class 1 (08-30-2005)**
**[dm-class1.ppt]**

**Lecture:** Course
Description & Outline, Class Format, Grades, Resources

Introduction to Structural Data
Mining and Modeling.

Structural Data Collection (the importance of high quality data) &
Data Analysis Basics

**Project
1:** **Personal Web page** that tells about
you, your expertise, and your expectations on this course. *It will be used
to adjust course materials to the programming skills and (departmenal) background
of registered students and to setup the handin web pages. In addition, you will
be asked to phrase a scientific question that network science and/or complex
systems approaches can help answer; select an appropriate dataset; and discuss
how this question might be solved.*

**Due 09-05-2005
at 8pm (~1 week).**

**Class 2 (09-06-2005)**
**[dm-class2.ppt,
dm-lab1.ppt]**

**Lecture:** Introduction to
Graph Theory

**Readings: **Strogatz
S. H. Exploring complex networks. Nature 410, 268-276
(2001).

Hayes (2000) Graph theory in practice: Part
I and Part II, American Scientist, Volume 88,
No. 2, 9-13 and 104-109.

**Lab 1 in Shepherd Lab,
Main Library 030:** Data Analysis Basics - Introduction to the IVC
Software Framework** **

**Project
2: Analyze a structural data set.**

**Due 09-26-2005
at 8pm (~ 3 weeks).**

**Class 3 (09-13-2005)**
**[dm-class3.ppt,
dm-lab2.ppt]**

**Lecture:** Matrix Algebra
Primer

**Readings:** Andrei Broder,
Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopalan, Raymie
Stata, Andrew Tomkins & Janet Wiener (2002) Graph structure in the web: Experiments
and models. Computer Networks, 33, pp. 309-320.

**Lab 2 in Shepherd Lab,
Main Library 030:** Network Analysis
Toolkit & Pajek.

**Structural Data
Mining**

**Class 4 (9-20-2005)**
**[dm-class4.ppt,
dm-lab3.ppt]**

**Lecture:** Data Mining vs.
Structural Data Mining

**Reading:**

Kristin P. Bennett &
Colin Campbell (2000) Support Vector Machines: Hype or
Hallelujah? SIGKDD Explorations, 2, 2, pp. 1-13.

J.
Kleinberg (1998) Authoritative sources
in a hyperlinked environment. Proc. 9th ACM-SIAM Symposium on Discrete Algorithms.
Extended version in Journal of the ACM 46 (1999). Also appeared as IBM Research
Report RJ 10076, May 1997.

Listening: Talks by Monika
Henzinger & Steve Lawrence

**Presentation: **
Pedro Domingos, Matt Richardson.
The Intelligent Surfer: Probabilistic Combination of
Link and Content Information in PageRank. Advances in Neural Information
Processing Systems 14, 2002.

F.
Menczer (2002) Growing and Navigating
the Small World Web by Local Content. Proc. Natl. Acad. Sci. USA 99(22):
14014-14019.**
[John Burgoon]**

**Lab 3 in Shepherd Lab,
Main Library 030:** UCINET

**Class 5: (9-27-2005)**
**[dm-class5.ppt,
dm-lab4.ppt]**

**Lecture: **Random Networks
vs. Small World Networks vs. Scale Free Networks, Power Laws in Graphs

**Reading: **Watts D. J. and
Strogatz S. H. Collective dynamics of
'small-world' networks. Nature 393, 440-442 (1998).

**Presentations:** Reka Albert,
Albert-Laszlo Barabasi (2002). Statistical mechanics of complex
networks. cond-mat/0106096:

Small World Networks **[Todd
Holloway] **

Scale Free Model

**Project
3: Analyze and visualize a scholarly data set. **

**Due 10-17-2005
at 8pm (~ 3 weeks).**

**10-03-2005
InfoVis
Lab Open House, 4-6p, Lilly Library, IUB. **

**Class 6: (10-04-2005)**
**[dm-class6.ppt]**

**Lecture:** Structural Similarity
Measures

**Reading:** Goldsmith, T. E.
& Davenport, D. M. (1990). Assessing structural
similarity of graphs. In R. W. Schvaneveldt (Ed.), Pathfinder associative
networks: Studies in knowledge organization (pp. 75-87), Norwood, NJ: Ablex.
**Surf: **References
for Graph Similarity

**Presentation: **Krumhansl
(1978) Concerning the applicability of geometric models
to similarity data: The interrelationship between similarity and spatial density.
Psychological Review, 85, 445-463.

Pruzansky, Tversky & Carroll
(1982) Spatial versus tree representations of proximity
data. Psychometrika, 47, 3-24.** [****Vsevolod
Kapatsinski]**

**Lab 5 in Shepherd Lab,
Main Library 030:** VxOrd and VxInsight
and other tools compiled by John Burgoon.

**Class 7: (10-11-2005)**
**[dm-class7.ppt]**

**Lecture: **Dimensionality
Reduction Techniques & Clustering Techniques

**Readings:** R. N. Shepard.
Multidimensional scaling, tree fitting, and clustering.
Science, 210:390-398 (1980).

**Presentation:** C.T. Zahn.
Graph-Theoretical Methods for Detecting and Describing Gestalt
Clusters. IEEE Transactions on Computers, 20:68-86, 1971.**
[Gavin
LaRowe] **Hands-on
session: Coupling network analysis and visualization.

Lab in 001:

**Descriptive and
Predictive Data Modeling**

**Class 8: (10-18-2005)****
** **[dm-class8.ppt]**

Descriptive Modeling Techniques
vs. Predictive/Generative Modeling Techniques

**Readings:** Dorogovtsev, S. N. & Mendes, J. F.
F. (2002). Evolution of networks.
cond-mat/0106144

Reka Albert, Albert-Laszlo Barabasi
(2002). Statistical mechanics of complex
networks. cond-mat/0106096: Evolving Networks

**Final
Project 4: Data analysis and modeling. The
project can (1) test hypotheses derived from grand theory, (2) investigate the
relationships among a set of variables, and tell a story (i.e., construct a
theory) based on the results, or (3) use structural data mining and modeling
to diagnose a problem and prescribe a solution based on the diagnosis.**

**Due 11-28-2005
at 8pm.**

**Class 9: (10-25-2005)**
**
** **[dm-class9.ppt]**

**Guest Lecture by Larry Yaeger**

**Lecture:** Complex Adaptive
Systems.

**Reading:
**Ball, P. (1999). The self-made tapestry. Oxford, England: Oxford University
Press, chapter on Communities.

**Presentation:**
Robert
Axelrod (1998) Disseminating Culture - The Dissemination
of Culture: A Model with Local Convergence and Global Polarization. In The
Complexity of Cooperation: Agent-Based Models of Competition and Collaboration.
Princeton University Press, chapter 7.

Robert Axelrod (1997) Replication of Agent Based Models. In The Complexity of
Cooperation. Princeton University Press. **[Adity
Upoma]**

**Network Dynamics**

**Class 10: (11-01-2005)
[dm-class10.ppt]**

**Class 11: (11-8-2005)**

**Lecture: **Modeling
Diffusion in Evolving Network Ecologies. **[dm-class11.ppt]**

**Reading:**
Peter Sheridan Dodds, Duncan J. Watts, and Charles F. Sabel (2003). Information
exchange and the robustness of organizational networks. PNAS, vol. 100,
12516-12521.

**Presentation
1:** Torsten Hagerstrand. A Monte Carlo Approach to Diffusion. In Spatial Analysis
- A Reader in Statistical Geography. Brian J.L. Berry & Duane F. Marble
(eds.), Prentice Hall, pp.368-384. **[Justin
Donaldson]**

**Presentation 2:**
Romualdo Pastor-Satorras and Alessandro Vespignani (2003). Epidemics
and Immunization in Scale Free Networks. In Handbook of Graphs and Networks:
From the Genome to the Internet. Stefan Bornholdt and Hans Georg Schuster (Eds.),
Wiley-VCH, pp. 111-130. **[Bryan
Hasty]**

**Selected Applications**

**Class 12: (11-15-2005)**
**[dm-class12.ppt,
p2psearch.ppt]**

**Lecture & Lab: **Selected
Applications

*Knowledge
Discovery, **Content Filtering
& Search*

**Reading:
**D. Gibson, J. Kleinberg & P. Raghavan
(1998) Inferring Web communities from link topology.
Proc. 9th ACM Conference on Hypertext and Hypermedia.

Gary Flake, Steve Lawrence, C. Lee Giles & Frans Coetzee (2002) Self-Organization and Identification of Web Communities. IEEE Computer, 35:3. See http://webselforganization.com/

S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, S.R. Kumar, P. Raghavan, S. Rajagopalan & A. Tomkins, Hypersearching the Web. Scientific American, June 1999.

**Thanksgiving
Recess**

**Class 14: (11-29-2005)**

**Lecture & Lab: **Selected
Applications

*Social Network Analysis *

**Reading:
**Wasserman, S. & Pattison, P. (2003). Network
analysis. In Lewis-Beck, M., Bryman, A., and Liao, T.F. (eds.) Encyclopedia
of Social Science Research Methods, to appear. Oregon, OH: Sage Publications.

Mark E. J. Newman (2001) Who is the best connected scientist? A study of scientific coauthorship networks. Phys.Rev. E64 (2001) 016131; Phys.Rev. E64 (2001) 016132.

M. Girvan and M. E. J. Newman. Community structure in social and biological networks. Proc. Natl. Acad. Sci. USA 99, 8271-8276 (2002).

**Bioinformatics
**

**Optional Reading: **

Papers by Uri
Alon** **& Albert László Barabási.
**
** Lada A.
Adamic, Dennis Wilkinson, Bernardo A. Huberman, Eytan Adar.
A Literature Based Method for Identifying Gene-Disease
Connections. IEEE Computer Society Bioinformatics Conference (CSB'02), 2002,
pp. 109-118.

**Class 15: (12-06-2005)**

**Lecture: **Final Project
Presentations

This section of the course web page will frequently be updated. Please suggest links to include.

**Software:**

- R. A language and environment for statistical computing and graphics http://www.r-project.org/.
- PAJEK. A program for analyzing large networks, and arguably the best drawing program on the market. It is free and available at: http://vlado.fmf.uni-lj.si/pub/networks/pajek/.
- UCINET 6. Social Network Analysis Program http://www.analytictech.com/downloaduc6.htm.
- StarLogo. A programmable modeling environment for exploring the behaviors of decentralized systems such as bird flocks, traffic jams, and ant colonies. http://education.mit.edu/starlogo/
- MadKit http://www.madkit.org/ a Java multi-agent platform built upon an organizational model.
- NetVis Module - Dynamic Visualization of Social Networks by Jonathon N. Cummings
- X-means by Dan Pelleg is available in both binary and source code
- Visione - Analysis and visualization of social networks.
- More Computational Models and Social Network Tools

Internet topology: Autonomous system graphs

World Wide Web: See Albert-László Barabási's collection of network data.

Collaboration networks: Datasets of collaboration among scientists, movie actors, and business people.

Protein interaction networks (nodes resemble proteins in a cell, edges join protein pairs that have been found to interact): Yeast

Cellular data: See Albert-László Barabási's collection of network data.

Semantic networks - Free association datasets for words (nodes resemble words, edges connect words uttered by test subjects in free response to cue words): Edinburgh Associative Thesaurus, University of South Florida Free Association Norms, Princeton's WordNet.

**Additional Reading:**

**Excellent Review
articles:**

Mark E. J. Newman (2001a).
Scientific collaboration networks. I. Network construction and
fundamental results. Physical Review E, 64, 016131.

Mark E. J. Newman (2001b).
Scientific collaboration networks. II. Shortest paths, weighted
networks, and centrality. Physical Review E, 64, 016132.

*Graph Theory*

Harary, F. (1969) Graph
Theory. Addison-Wesley.

Knoke and Kuklinski
(1982) Network Analysis: Sage Publications.

*Data Mining*

David J. Hand, Heikki
Mannila, Padhraic Smyth (2001) Principles of Data Mining: MIT Press.
*(on reserve)*

Jiawei Han, Micheline
Kamber. Data Mining (2000) Concepts and Techniques: Morgan Kaufmann
Publishers.

Ian H. Witten &
Eibe Frank. Data Mining (1999) Practical
Machine Learning Tools and Techniques with Java Implementations.
Morgan Kaufmann Publishers.

*Social Network Analysis*

Linton
C. Freeman (2004) The development of social network
analysis: a study in the sociology of science. Empirical Press.

A. Degenne and M. Forse (1999). Introducing Social Networks: London, Sage Publications.

Stanley Wasserman, Katherine Faust,
Dawn Iacobucci (1994). Social Network Analysis: Methods and Applications: Cambridge
Univ Press.

Wasserman and Faust (1994). Social
Network Analysis: Cambridge University Press.

Wasserman, S. & Galaskiewicz,
J. (1994). Advances in Social Network Analysis: Sage.

Everett M. Rogers (1995). The Diffusion
of Innovations: Free Press.

*Bibliometrics*

Braam, R. B. (1991).
Mapping of Science: Foci of Intellectual Interest in Scientific
Literature: DSWO Press, University of Leiden.

Craine, D. (1972).
Invisible Colleges: Diffusion of Knowledge in Scientific Communities:
The University of Chicago Press.

Donohue, J. C. (1973).
Understanding Scientific Literature: A Bibliometric Approach: MIT Press.

Garfield, E. (1979).
Citation Indexing - Its Theory and Application in Science, Technology
and Humanities: ISI Press, Philadelphia.

Irvine, J. & Martin,
B. (1989). Research Foresight: Pinter Publisher.

Valente, T.W. (1995).
Network Models of the Diffusion of Innovations: Hampton Press.

Egghe, L. & Rousseau,
R. (1990). Introduction to Informetrics. Elsevier.

*Dynamic Systems*

Buchanan, M. (2002).
Nexus: Small Worlds and the Groundbreaking Science of Networks: Norton
& Company.

Johnson, S. (2001).
Emergence: The Connected Lives of Ants, Brains, Cities, and Software:
Scribner.

Kennedy, J., Russell C.
Eberhart, R. C., Shi, Y. (2001) Butterfly Economics: A New General
Theory of Social and Economic Behavior: Basic Books.

Duncan J. & Watts, D.
J. (1999). Small Worlds: Princeton Univ Press.

Albert-László
Barabási (2002). Linked: The New Science of Networks: Perseus
Publishing.

*Information
Diffusion*

Valente, Tom W. (1995).
Network Models of the Diffusion of Innovations. Hampton Press.

Everett M. Rogers (1995).
The Diffusion of Innovations. Free Press.

**Online Resources:**

International Network for Social
Network Analysis

SocioSite
Networks, Groups and Social Interaction

**Computing Services
& Software:**

University Computing Services Help
Online

Searching is a trick you
can learn.

**Related Courses at IU:
**L597 Statistics
for Information Science and Usability by John C. Paolillo, IUB.

Data Mining and Analysis by F. Chris Allbright, Kelley School of Business, IUB.

P747 Advanced Model Fitting and Model Comparison: Computer techniques and theoretical issues by John K. Kruschke, Psychology, IUB.

P747/Q700: Bayesian and Data Mining by Richard M. Shiffrin, PSY, IUB.

Complex Adaptive Systems by Robert Goldstone, PSY, IUB.

I400/I590 Biologically Inspired Computing by Luis M. Rocha, SoI, IUB (Fall 2005).

Network Analysis by Stanley Wasserman for Sociology, IUB (Spring 2005).

Multivariate Methods by Stanley Wasserman for Psychology, IUB (Spring 2005).

I400/590 Topics in Informatics - Games and Gossip (3 cr) by Marco Janssen, Informatics, IUB (Spring 2005).

I400/I590 Artificial Life as an approach to Artificial Intelligence (3 cr) by Larry Yaeger, Informatics, IUB (Spring 2005).

B669: Managing & Mining Scientific Data, Topics in Database and Information Systems by Ed Robertson and Dirk Van Gucht, CS, IUB.

CSCI B659 Topics in Artificial Intelligence: Web Mining by Filippo Menczer, Informatics, IUB

P 700 Fractals and Pattern Formation (1 cr) by Y. Hayakawa, Physics, IUB.

INFO I512 Scientific Data Management and Analysis I590 High Performance Data Management and Processing in Informatics, IUB by Zdzisaw Meglicki, UITS.

T610 Networked Society by Harmeet Sawhney, Telecommunications, IUB.

T603 Communication Networks by J. Alison Bryant, Telecommunications, IUB. (Starts in Fall 23005)

CSCI590: Data Mining by Snehasis Mukhopadhyay, CS, IUPUI.

**Related Courses off Campus:**

Networks and Complexity
in Social Systems by D. J. Watts, Columbia University.

Models of Social Dynamics
by David Gibson, Harvard University.

Introduction
to Social Network Methods by Robert A. Hanneman, University of California.

Social Network Analysis
by Robert A. Hanneman, Department of Sociology, University of California.

Scaling, Power Laws, and the Small
World Phenomena in Networks by Don Towsley, University of Massachusetts

The Structure of Information
Networks by Jon Kleinberg, Cornell University

Biological Applications
of Physics by Paul Ginsparg, Cornell University

Complex
Systems by Mark Newman,
University of Michigan

Seminar on
Social Networks by James
Moody, The Ohio State University

Complex Human
Networks Reading Group

Study Group on Complex
Networks

**Workshops &
Conferences**

International Conference
on Computability and Complexity in Analysis to be held in Cincinnati, August
28-30, 2003.

http://ella.slis.indiana.edu/~katy/L597

Last modified: 09/16/2005