• If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!


Questions about topic modeling

Page history last edited by Alan Liu 9 years, 10 months ago

Questions about topic modeling harvested from

discussion in a graduate class on digital humanities

(October 29, 2013)

(English 236, U. California, Santa Barbara.  Instructor: Alan Liu)


As part of our class on topic modeling in English 236, we kept a running log of questions that the students and instructor (Alan Liu) I asked and left hanging in our discussion.  The instructor edited, filled in the context for, and sometimes elaborated on these questions here.  Most of us are humanities or education scholars from several disciplines who are beginners at many of the questions included in this course.  These questions remained after our readings and practicums--and, in fact, our questions were often sparked (or stoked higher) by those exercises.  We're aware that there is additional research and discussion on some of these issues, likely at a level beyond that of beginners or occurring in other fields.  Suggestions for reading welcome!

(Thanks to Ashley Champagne (@ashleymchamp), a student in the course, and William Warner, my colleague sitting in on the course, for acting as transcribers of our questions.)  --Alan Liu (@alanyliu)



  1. What difference does it make that we want to use topic modeling for humanities interpretation as opposed to the kinds of information retrieval it was (in part) originally designed for?  What do we mean by "discovery" in each context?
  2. What do we know theoretically or empirically about the "handoff" phase of topic modeling when the results of machine processing are handed off to humans for mental processing?  For example, is there cognitive-science research or design research on that?
  3. What consensus is there among digital humanists about the degree to which, and methods for, pre-filtering and pre-processing input material in order to get sensible output from topic modeling?
  4. What useful or interesting takeaways are there for humanists in the fact that multiple runs of topic modeling on the same material, with the same parameters, can produce slightly different results?  For a discipline like the humanities without any native understanding of statistics, is that just going to be irremediably unsettling?
  5. How do we know when it is appropriate to use topic modeling to analyze single works versus a medium-sized or large corpus of works?
  6. To what extent do we need to have a pre-existing model of meaningful structure in documents (i.e., of the kinds of meaning we expect to emerge from varying levels of structure such as words, phrases, sentences, paragraphs, chapters, sections of poems, genres, etc.) before we can do useful topic modeling?  Do the pre-processing activities of creating stop-lists and "chunking" that represent that model of meaningful structure merely sneak our preconceptions into the process, thus mooting the possibility of any really radical discovery?
  7. The topic model that results from LDA processing might be conceived as a 3D crystal with multiple facets (e.g., the topic view of the crystal, the document view, the word view).  How does choosing a particular point of view affect the interpretive "handoff" of the model to the human interpreter?
  8. A question sparked by a student's practicum of running Freud's Interpretation of Dreams through LDA topic modeling: what is the relation between psychoanalytic and topic-modeling "analysis?  What does "latent" mean in each case?  What similarities of structure or concept are there between a "topic" and a "dream"? (On a less serious note: if Freud had used LDA on his patients, would all topics ultimately have reduced to a single topic with such words as "Father . . . Mother . . . [and the names for a present or absent body part]"?)
  9. To what degree do uncontrolled contingencies affect topic modeling?  Uncontrolled contingencies include: lack of a complete or representative corpus due to any number of reasons; lack of human energy and time to fine-tune the pre-processing of material or to do enough iterative re-modeling; etc.
  10. Probability phenomena (such as the "probability distributions" represented by topic models) scare humanists.  Why?  What would a humanities that is fully accepting of probability phenomena, including the entropic understanding of information, look like?  For example (citing an example of "motif" in Boris Tomashevsky's "Thematics" essay that we considered in class): can we imagine a humanities for which the sentence "Raskolnikov kills the old woman" would be just as satisfying if reformulated, "There is a 70% chance that Raskolnikov [or, put another way, 70% of Raskolnikov] killed (82%) or possibly just maimed (15%) the old (78%) or young (14%) woman (63%), man (21%), or cat (5%)"?  Not Schrödinger's cat, in other words, but, in this case, Schrödinger's "old woman"?
  11. Given the Franco-Prussian background of Johann Peter Gustav Lejeune Dirichlet, is "Dirichlet" pronounced with a bias toward the German or French influence?  Or are we fated forever to refer only to the acronym "LDA" in oral forums where topic modeling is discussed?  (Thanks to @UnaMcIvenna and @Kimberly Garmoe on Twitter, and also Wikipedia, for consultation on this last important issue!)



Comments (0)

You don't have permission to comment on this page.