L597: Structural Data Mining and Modeling (Fall 2005)

Course Description | Grade | Course Outline | Resources

"The purpose of computing is insight – not numbers."
R. W. Hamming (1962)
Instructor: Katy Börner | Email: katy@indiana.edu | Office: Main Library 019 | Phone: 855-3256
Assistant Instructor: Weimao Ke | Email: wke@indiana.edu

Lecture:  Tue Tue 1p-2:15p & 2:30p-3:45p, LI 001
Office hours: Mon 3-4p, LI 019 (Katy), Fri 2-3p, LI 020 (Weimao)

Prerequisites: L401 or equivalent

Class Listserver: katy_L597-L@listserv.indiana.edu
Class Web Page: http://ella.slis.indiana.edu/~katy/L597-Fall05
Project Handin Webpage: http://ella.slis.indiana.edu/~katy/handin/L597-F05/cgi/handinlogin.cgi

Web Resources: MS Excel Primer, Pajek tutorial, Perl Scripts

Recorded Talks: Information Processing Talks 2002-2003, Sackler Colloquium on Mapping Knowledge Domains Talks, Digital Biology Talks, Complexity Digest 2000.31,

Talk Series on Networks and Complex Systems, Spring 05, Fall 04.

Course Description
This course introduces students to major methods, theories, and applications of structural data mining and modeling. It covers elementary graph theory and matrix algebra, data collection, structural data mining, data modeling, and applications.

Students will learn how to frame the research question, collect the data, run the analysis, and interpret the results. In addition, they will learn how to design and evaluate descriptive and predictive models of diverse complex networks to improve their understanding of the underlying principles.

Upon taking this course students will be able to analyze and describe real networks (power grids, WWW, social networks, etc.) as well as relevant phenomena such as disease propagation, search, organizational performance, social power, and the diffusion of innovations.

Course Format
A combination of lectures and 4-5 labs will be utilized to introduce material from such diverse domains as mathematics (graph theory, matrix algebra, chaos theory), computer science (data mining, knowledge discovery, structural pattern matching, complex network theory), information science (bibliometrics, webometrics, social network analysis), physics (nonlinear dynamics, statistical physics), biology (genomics, food webs), etc. MS Excel, Matlab and Pajek will be employed for data mining and modeling. Students will improve their collaborative skills by working in topic dependent teams.

Individual and group work will be evaluated according to how well the course material is understood and implemented into projects, as well as the quality of written and oral presentations. Students are expected to spent about 8h per week outside of class for readings and projects.

Students will be required to do weekly readings of journal articles and participate in class discussions. Class participation is evaluated on the frequency of relevant, constructive contributions that reflect a close reading of assigned materials and thoughtful reflection on the topic during class as well as via the class majordomo list and handin web pages.

The 15 min presentation will address a specific topic/question and will be based on readings from the literature or internet. Sources will be provided. If you can find more that's great. See Preparation of Presentations for more details.
You are expected to use the office hours the week before you will give the presentation to discuss your preparation with the instructor. Prepare your presentation as well as any specific questions you may have in advance.
All students will be expected to participate in class by reading the assigned material plus asking and answering questions. Reading materials are assigned for study in preparation for class discussion. Thus, class 2 material should be completed before attending the second class. Please do surf the listed web pages so that we can discuss them in class.

Diverse small projects are designed to create competence in using taught methods and the software effectively. A final research project (which may be done in groups and in collaboration with clients) is due at the end of the semester. Detailed descriptions of projects will be provided. Results have to be submitted via the handin web page.

The final grade breakdown will be as follows: class participation (20%), presentation of selected readings (10%), small projects (30%), and final project (40%). Grades are assigned according to the grading standards of SLIS.

1. Class attendance: Email the instructor if you can't make it to a class.
2. Plagiarism:Clearly indicate if you use materials from other sources. Academic and personal misconduct by students in this class are dealt with according to the Student Disciplinary Procedures.
3. Late Handin Policy: Late assignments or incomplete assignments are allowed only because of an unforeseen emergency that is preceded by diligent work, not for a pattern of weak performance. No individual student will be allowed to do extra work to raise the final grade or to make up missing work. All grades become final one week after the material is returned to you. No claims, however justifiable, will be considered after this deadline. If there is a medical or personal reason requiring you to miss an exam, you must present your excuse in advance and in writing, and we require some physical proof. Course work handed in
  •  Within the first 10 min past 10 pm will receive at most 90% of the possible points.
  •  Between 10.10 pm  to 11 pm receive at most 50% of the possible points.
  •  Past 11 pm receive Z.

  • Make sure you handin in time and your handin is accessible and readable!

    Credits: 3 for L597

    Course Outline
    The class schedule may change as the course progresses. Changes will be posted on the L597 majordomo list.


    Class 1 (08-30-2005) [dm-class1.ppt]
    Lecture: Course Description & Outline, Class Format, Grades, Resources
    Introduction to Structural Data Mining and Modeling.
    Structural Data Collection (the importance of high quality data) & Data Analysis Basics

    Project 1: Personal Web page that tells about you, your expertise, and your expectations on this course. It will be used to adjust course materials to the programming skills and (departmenal) background of registered students and to setup the handin web pages. In addition, you will be asked to phrase a scientific question that network science and/or complex systems approaches can help answer; select an appropriate dataset; and discuss how this question might be solved.
    Due 09-05-2005 at 8pm (~1 week).

    Class 2 (09-06-2005) [dm-class2.ppt, dm-lab1.ppt]
    Lecture: Introduction to Graph Theory
    Readings: Strogatz S. H. Exploring complex networks. Nature 410, 268-276 (2001).
    Hayes (2000) Graph theory in practice: Part I and Part II, American Scientist, Volume 88, No. 2, 9-13 and 104-109.

    Lab 1 in Shepherd Lab, Main Library 030: Data Analysis Basics - Introduction to the IVC Software Framework

    Project 2: Analyze a structural data set.
    Due 09-26-2005 at 8pm (~ 3 weeks).

    Class 3 (09-13-2005) [dm-class3.ppt, dm-lab2.ppt]
    Lecture: Matrix Algebra Primer
    Readings: Andrei Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopalan, Raymie Stata, Andrew Tomkins & Janet Wiener (2002) Graph structure in the web: Experiments and models. Computer Networks, 33, pp. 309-320.
    Lab 2 in Shepherd Lab, Main Library 030: Network Analysis Toolkit & Pajek.

    Structural Data Mining

    Class 4 (9-20-2005) [dm-class4.ppt, dm-lab3.ppt]
    Lecture: Data Mining vs. Structural Data Mining
    Kristin P. Bennett & Colin Campbell (2000) Support Vector Machines: Hype or Hallelujah? SIGKDD Explorations, 2, 2, pp. 1-13.

    J. Kleinberg (1998) Authoritative sources in a hyperlinked environment. Proc. 9th ACM-SIAM Symposium on Discrete Algorithms. Extended version in Journal of the ACM 46 (1999). Also appeared as IBM Research Report RJ 10076, May 1997.
    Listening: Talks by Monika Henzinger & Steve Lawrence
    Presentation: Pedro Domingos, Matt Richardson. The Intelligent Surfer: Probabilistic Combination of Link and Content Information in PageRank. Advances in Neural Information Processing Systems 14, 2002.
    F. Menczer (2002) Growing and Navigating the Small World Web by Local Content. Proc. Natl. Acad. Sci. USA 99(22): 14014-14019. [John Burgoon]

    Lab 3 in Shepherd Lab, Main Library 030: UCINET

    Class 5: (9-27-2005) [dm-class5.ppt, dm-lab4.ppt]
    Lecture: Random Networks vs. Small World Networks vs. Scale Free Networks, Power Laws in Graphs
    Reading: Watts D. J. and Strogatz S. H. Collective dynamics of 'small-world' networks. Nature 393, 440-442 (1998).
    Presentations: Reka Albert, Albert-Laszlo Barabasi (2002). Statistical mechanics of complex networks. cond-mat/0106096:
    Small World Networks [Todd Holloway]
    Scale Free Model [Muzaffer Ozakca]
    Lab 4 in Shepherd Lab, Main Library 030: Chizu

    Project 3: Analyze and visualize a scholarly data set.
    Due 10-17-2005 at 8pm (~ 3 weeks).

    10-03-2005 InfoVis Lab Open House, 4-6p, Lilly Library, IUB.

    Class 6: (10-04-2005) [dm-class6.ppt]
    Lecture: Structural Similarity Measures
    Reading: Goldsmith, T. E. & Davenport, D. M. (1990). Assessing structural similarity of graphs. In R. W. Schvaneveldt (Ed.), Pathfinder associative networks: Studies in knowledge organization (pp. 75-87), Norwood, NJ: Ablex.  Surf: References for Graph Similarity
    Presentation: Krumhansl (1978) Concerning the applicability of geometric models to similarity data: The interrelationship between similarity and spatial density. Psychological Review, 85, 445-463.
    Pruzansky, Tversky & Carroll (1982) Spatial versus tree representations of proximity data. Psychometrika, 47, 3-24. [Vsevolod Kapatsinski]
    Lab 5 in Shepherd Lab, Main Library 030: VxOrd and VxInsight and other tools compiled by John Burgoon.

    Class 7: (10-11-2005) [dm-class7.ppt]
    Lecture: Dimensionality Reduction Techniques & Clustering Techniques
    Readings: R. N. Shepard. Multidimensional scaling, tree fitting, and clustering. Science, 210:390-398 (1980).
    Presentation: C.T. Zahn. Graph-Theoretical Methods for Detecting and Describing Gestalt Clusters. IEEE Transactions on Computers, 20:68-86, 1971. [Gavin LaRowe]
    Lab in 001:
    Hands-on session: Coupling network analysis and visualization.

    Descriptive and Predictive Data Modeling

    Class 8: (10-18-2005) [dm-class8.ppt]
    Descriptive Modeling Techniques vs. Predictive/Generative Modeling Techniques
    Readings: Dorogovtsev, S. N. & Mendes, J. F. F. (2002). Evolution of networks. cond-mat/0106144
    Reka Albert, Albert-Laszlo Barabasi (2002). Statistical mechanics of complex networks. cond-mat/0106096: Evolving Networks

    Final Project 4:  Data analysis and modeling. The project can (1) test hypotheses derived from grand theory, (2) investigate the relationships among a set of variables, and tell a story (i.e., construct a theory) based on the results, or (3) use structural data mining and modeling to diagnose a problem and prescribe a solution based on the diagnosis.
    Due 11-28-2005 at 8pm.

    Class 9: (10-25-2005) [dm-class9.ppt]
    Guest Lecture by Larry Yaeger
    Lecture: Complex Adaptive Systems.
    Reading: Ball, P. (1999). The self-made tapestry. Oxford, England: Oxford University Press, chapter on Communities.
    Presentation: Robert Axelrod (1998) Disseminating Culture - The Dissemination of Culture: A Model with Local Convergence and Global Polarization. In The Complexity of Cooperation: Agent-Based Models of Competition and Collaboration. Princeton University Press, chapter 7.
    Robert Axelrod (1997) Replication of Agent Based Models. In The Complexity of Cooperation. Princeton University Press. [Adity Upoma]

    Network Dynamics

    Class 10: (11-01-2005) [dm-class10.ppt]
    Lecture: Predictive/Generative Modeling Techniques. Network Structure vs. Global Behavior.
    Reading: Katy Börner, Jeegar Maru, and Robert Goldstone (2004). The Simultaneous Evolution of Author and Paper Networks. PNAS, 101(Suppl_1) 5266-5273.

    Presentation: Presentation: M. Granovetter. The strength of weak ties. American Journal of Sociology, 78(6), 1360-1380 (1973). & Mark Granovetter (1983) The strength of weak ties. A Network Theory Revisited. Sociological Theory. 1, 201-233. [Josepth Cottam]
    Lab: Present your Final Project goals together with first results.

    Class 11: (11-8-2005)
    Lecture:  Modeling Diffusion in Evolving Network Ecologies. [dm-class11.ppt]
    Reading: Peter Sheridan Dodds, Duncan J. Watts, and Charles F. Sabel (2003). Information exchange and the robustness of organizational networks. PNAS, vol. 100, 12516-12521.
    Presentation 1: Torsten Hagerstrand. A Monte Carlo Approach to Diffusion. In Spatial Analysis - A Reader in Statistical Geography. Brian J.L. Berry & Duane F. Marble (eds.), Prentice Hall, pp.368-384. [Justin Donaldson]
    Presentation 2: Romualdo Pastor-Satorras and Alessandro Vespignani (2003). Epidemics and Immunization in Scale Free Networks. In Handbook of Graphs and Networks: From the Genome to the Internet. Stefan Bornholdt and Hans Georg Schuster (Eds.), Wiley-VCH, pp. 111-130. [Bryan Hasty]

    Selected Applications

    Class 12: (11-15-2005) [dm-class12.ppt, p2psearch.ppt]
    Lecture & Lab: Selected Applications
    Knowledge Discovery, Content Filtering & Search
    D. Gibson, J. Kleinberg & P. Raghavan (1998) Inferring Web communities from link topology. Proc. 9th ACM Conference on Hypertext and Hypermedia.

    Gary Flake, Steve Lawrence, C. Lee Giles & Frans Coetzee (2002) Self-Organization and Identification of Web Communities. IEEE Computer, 35:3. See http://webselforganization.com/
    S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, S.R. Kumar, P. Raghavan, S. Rajagopalan & A. Tomkins, Hypersearching the Web. Scientific American, June 1999.
    Search on Small World Networks: J. Kleinberg. The small-world phenomenon: An algorithmic perspective. Proc. 32nd ACM Symposium on Theory of Computing, 2000. Also appears as Cornell Computer Science Technical Report 99-1776, October 1999.

    Search on Scale-Free Networks: Lada A. Adamic, Rajan M. Lukose, Amit R. Puniyani & Bernardo A. Huberman (2001) Search in Power-Law Networks. Phys. Rev. E, 64 46135.
    Optional Reading: Fletcher, George , Sheth, Hardik and Börner, Katy. (2004). Unstructured Peer-to-Peer Networks: Topological Properties and Search Performance. Third International Joint Conference on Autonomous Agents and Multi-Agent Systems. W6: Agents and Peer-to-Peer Computing, Moro, Gianluca, Bergmanschi, Sonia and Aberer, Karl, Eds., New York, July 19-23, pp. 2-13.

    Thanksgiving Recess

    Class 14: (11-29-2005) 
    Lecture & Lab: Selected Applications
    Social Network Analysis

    Wasserman, S. & Pattison, P. (2003). Network analysis. In Lewis-Beck, M., Bryman, A., and Liao, T.F. (eds.) Encyclopedia of Social Science Research Methods, to appear. Oregon, OH: Sage Publications.
    Mark E. J. Newman (2001) Who is the best connected scientist? A study of scientific coauthorship networks. Phys.Rev. E64 (2001) 016131; Phys.Rev. E64 (2001) 016132.

    Presentation 1: Duncan J. Watts, Peter Sheridan Dodds & M. E. J. Newman. Identity and Search in Social Networks. Science 296, 1302-1305 (2002).

    M. Girvan and M. E. J. Newman. Community structure in social and biological networks. Proc. Natl. Acad. Sci. USA 99, 8271-8276 (2002). [Jonathan Klinginsmith]
    Presentation 2: Valente Tom W. (1996) Social network thresholds in the diffusion of innovations. Social Networks, 18, 69-89 (1996). [Aaron Curtis]

    Guest Lecture by Luis Rocha on "Bibliome Informatics for Protein Function Prediction"
    Reading: Maguitman, A. G., Rechtsteiner, A., Verspoor, K., Strauss, C.E., Rocha, L.M. Large-Scale Testing Of Bibliome Informatics Using Pfam Protein Families. In: Pacific Symposium on Bioinformatics 2006. In Press.

    Optional Reading:
    Papers by Uri Alon & Albert László Barabási.
    Lada A.  Adamic, Dennis  Wilkinson, Bernardo A.  Huberman, Eytan  Adar. A Literature Based Method for Identifying Gene-Disease Connections. IEEE Computer Society Bioinformatics Conference (CSB'02), 2002, pp. 109-118.

    Class 15: (12-06-2005)
    Lecture:  Final Project Presentations

    This section of the course web page will frequently be updated. Please suggest links to include.


    Network Data Sets:
    Internet topology: Autonomous system graphs
    World Wide Web: See Albert-László Barabási's collection of network data.
    Collaboration networks: Datasets of collaboration among scientists, movie actors, and business people.
    Protein interaction networks (nodes resemble proteins in a cell, edges join protein pairs that have been found to interact): Yeast
    Cellular data: See Albert-László Barabási's collection of network data.
    Semantic networks - Free association datasets for words (nodes resemble words, edges connect words uttered by test subjects in free response to cue words): Edinburgh Associative Thesaurus, University of South Florida Free Association Norms, Princeton's WordNet.

    Additional Reading:
    Excellent Review articles:
    Mark E. J. Newman (2001a). Scientific collaboration networks. I. Network construction and fundamental results. Physical Review E, 64, 016131.
    Mark E. J. Newman (2001b). Scientific collaboration networks. II. Shortest paths, weighted networks, and centrality. Physical Review E, 64, 016132.

    Graph Theory
    Harary, F. (1969) Graph Theory. Addison-Wesley.
    Knoke and Kuklinski (1982) Network Analysis: Sage Publications.

    Data Mining
    David J. Hand, Heikki Mannila, Padhraic Smyth (2001) Principles of Data Mining: MIT Press. (on reserve)
    Jiawei Han, Micheline Kamber. Data Mining (2000) Concepts and Techniques: Morgan Kaufmann Publishers.
    Ian H. Witten & Eibe Frank. Data Mining (1999) Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann Publishers.

    Social Network Analysis
    Linton C. Freeman (2004) The development of social network analysis: a study in the sociology of science. Empirical Press.
    A. Degenne and M. Forse (1999). Introducing Social Networks: London, Sage Publications.

    Stanley Wasserman, Katherine Faust, Dawn Iacobucci (1994). Social Network Analysis: Methods and Applications: Cambridge Univ Press.
    Wasserman and Faust (1994). Social Network Analysis: Cambridge University Press.
    Wasserman, S. & Galaskiewicz, J. (1994). Advances in Social Network Analysis: Sage.
    Everett M. Rogers (1995). The Diffusion of Innovations: Free Press.

    Braam, R. B. (1991). Mapping of Science: Foci of Intellectual Interest in Scientific Literature: DSWO Press, University of Leiden.
    Craine, D. (1972). Invisible Colleges: Diffusion of Knowledge in Scientific Communities: The University of Chicago Press.
    Donohue, J. C. (1973). Understanding Scientific Literature: A Bibliometric Approach: MIT Press.
    Garfield, E. (1979). Citation Indexing - Its Theory and Application in Science, Technology and Humanities: ISI Press, Philadelphia.
    Irvine, J. & Martin, B. (1989). Research Foresight: Pinter Publisher.
    Valente, T.W. (1995). Network Models of the Diffusion of Innovations: Hampton Press.
    Egghe, L. & Rousseau, R. (1990). Introduction to Informetrics. Elsevier.

    Dynamic Systems
    Buchanan, M. (2002). Nexus: Small Worlds and the Groundbreaking Science of Networks: Norton & Company.
    Johnson, S. (2001). Emergence: The Connected Lives of Ants, Brains, Cities, and Software: Scribner.
    Kennedy, J., Russell C. Eberhart, R. C., Shi, Y. (2001) Butterfly Economics: A New General Theory of Social and Economic Behavior: Basic Books.
    Duncan J. & Watts, D. J. (1999). Small Worlds: Princeton Univ Press.
    Albert-László Barabási (2002). Linked: The New Science of Networks: Perseus Publishing.

    Information Diffusion
    Valente, Tom W. (1995). Network Models of the Diffusion of Innovations. Hampton Press.
    Everett M. Rogers (1995). The Diffusion of Innovations. Free Press.

    Online Resources:
    International Network for Social Network Analysis
    SocioSite Networks, Groups and Social Interaction

    Computing Services & Software:
    University Computing Services Help Online
    Searching is a trick you can learn.

    Related Courses at IU: 
    L597 Statistics for Information Science and Usability by John C. Paolillo, IUB.
    Data Mining and Analysis
    by F. Chris Allbright, Kelley School of Business, IUB.

    P747 Advanced Model Fitting and Model Comparison: Computer techniques and theoretical issues by John K. Kruschke, Psychology, IUB.

    P747/Q700: Bayesian and Data Mining by Richard M. Shiffrin, PSY, IUB.
    Complex Adaptive Systems
    by Robert Goldstone, PSY, IUB.

    I400/I590 Biologically Inspired Computing by Luis M. Rocha, SoI, IUB (Fall 2005).
    Network Analysis by Stanley Wasserman for Sociology, IUB (Spring 2005)
    Multivariate Methods by Stanley Wasserman for Psychology, IUB (Spring 2005).
    I400/590 Topics in Informatics - Games and Gossip (3 cr) by Marco Janssen, Informatics, IUB (Spring 2005).
    I400/I590 Artificial Life as an approach to Artificial Intelligence (3 cr) by Larry Yaeger, Informatics, IUB (Spring 2005).
    B669: Managing & Mining Scientific Data, Topics in Database and Information Systems by Ed Robertson and Dirk Van Gucht, CS, IUB.
    CSCI B659 Topics in Artificial Intelligence: Web Mining by Filippo Menczer, Informatics, IUB
    P 700 Fractals and Pattern Formation (1 cr) by Y. Hayakawa, Physics, IUB.
    INFO I512 Scientific Data Management and Analysis
    I590 High Performance Data Management and Processing in Informatics, IUB by Zdzisaw Meglicki, UITS.
    T610 Networked Society by Harmeet Sawhney, Telecommunications, IUB.
    T603 Communication Networks by J. Alison Bryant, Telecommunications, IUB. (Starts in Fall 23005)
    CSCI590: Data Mining by Snehasis Mukhopadhyay, CS, IUPUI.

    Related Courses off Campus:
    Networks and Complexity in Social Systems by D. J. Watts, Columbia University.
    Models of Social Dynamics by David Gibson, Harvard University.
    Introduction to Social Network Methods by Robert A. Hanneman, University of California.
    Social Network Analysis by Robert A. Hanneman, Department of Sociology, University of California.
    Scaling, Power Laws, and the Small World Phenomena in Networks by Don Towsley, University of Massachusetts
    The Structure of Information Networks by Jon Kleinberg, Cornell University
    Biological Applications of Physics by Paul Ginsparg, Cornell University
    Complex Systems by Mark Newman, University of Michigan
    Seminar on Social Networks by James Moody, The Ohio State University

    Complex Human Networks  Reading Group
    Study Group on Complex Networks

    Workshops & Conferences
    International Conference on Computability and Complexity in Analysis to be held in Cincinnati, August 28-30, 2003.

    Home Faculty Katy Börner L597

    Last modified: 09/16/2005