Binder, Jeffrey M. “Alien Reading: Text Mining, Language Standardization, and the Humanities.”

  • When applied to the study of literary and cultural texts, statistical text-mining methods tend to reinforce conceptions of language and meaning that are, at best, overly dependent on the “literal” definitions of words and, at worst, complicit in the marginalization of nonstandard linguistic conventions and modes of expression. A shift toward a media studies perspective could enable scholars to engage with these linguistic technologies in a way that keeps their alienness in sight, foregrounding their biases and blind spots and emphasizing the historical contingency of the ways in which computers “read” texts. Thinking of text-mining programs as objects of cultural criticism could open up an interchange between digital scholarship and the critical study of computers that is productive in both directions. 
  • Models of Language: “Topic modeling” does not refer to any single method, but rather to a number of distinct technologies that attempt to determine the “topics” of texts automatically. The implementation most commonly used in the humanities is a program called MALLET, which presents topics as lists of words, starting with the word most strongly affiliated with the topic and proceeding downward. The program associates each text with one or more of these topics, which constitute a guess as to what that text is “about.” There is no certainty to this process; the topics are produced by an approximate method and so the results are slightly different every time the program is run. The meaning of the results is further complicated by the fact that the “topics” in the output do not necessarily correspond to anything for which a simple description might exist. This program is descended from Latent Dirichlet Allocation (LDA), which in turn is descended from Latent Semantic Indexing (LSI). All of these were trained on news and journal articles, which have a number of commonalities; standard dialect and orthography, privilege of the informational over the aesthetic dimensions of language; and primarily composed of prose. Many of the examples used in testing these methods are also the sorts of text that the military-industrial apparatus would have a clear interest in mining. Therefore, these programs work best on hegemonic forms of language and structure their results in such a way as to encourage interpretation in terms of the standardized meanings of words. 
  • Alien Reading: It’s virtually impossible for scholars to avoid text-mining software altogether, even if only indirectly through platforms like Google or JSTOR. The interfaces of our computers make their operations seem much simpler and more familiar than they really are, encouraging a passive, consumer-like orientation towards the computer rather than a deep understanding of it. So although we are familiar with terms search engines may use, like “topic” or “word”, the structures and assumptions regarding language that underlie these programs are fundamentally different from how we understand language. We should treat these with a healthy degree of skepticism and suspicion, treating text mining as an alien form of reading and at least make an effort to appreciate the extent to which the tools we use are unknown to us. 

Nelson, Robert K. “Mining the Dispatch.”

  • “Mining the Dispatch,” uses topic modeling to explore—and encourage exploration of—the dramatic and often traumatic changes as well as the sometimes surprising continuities in the social and political life of Civil War Richmond. It uses as its evidence nearly the full run of the Richmond Daily Dispatch from the eve of Lincoln’s election in November 1860 to the evacuation of the city in April 1865.  On the site, you are able to generate graphs and charts of various topics over this time period, as well as view specific examples. 

Sandweiss, Martha A. “Artifacts as Pixels, Pixels as Artifacts: Working with Photographs in the Digital Age.” November 1, 2013.

  • Historians tend to think of photographs as disembodied images or illustrations rather than as time-­bound physical artifacts, material objects that reward study in and of themselves, primary sources that can be as useful as literary sources. The turn to digitization of photographs in many ways has made research easier; you can search large archives from home, zoom in easily, and compare items from different sources. But it also comes with its own complications. Are you looking at a negative or a vintage print? How big is it and who took it? What’s written on the back? Was everything in this collection digitized, or only certain items? Who made the decision of what to digitize and why? 
  • Born-digital images are the source of even more problems. Beyond the sheer overwhelming number of them, which itself raises the issues of how to distinguish the typical from the unusual, or locate the data that identifies the image and lets us assess it as a historical document, there is also the question of integrity. Although photographers have always been able to edit their scenes and images in some ways, it was generally limited and viewable in the negatives. Now, photos can be majorly digitally altered with ease or even created on the computer in whole. And how can we evaluate the motivations of the photographers whose images we encounter online? How do we determine which images were made to promote a particular point of view, which to hurt or deceive? How can we even know who made the photograph? What kind of historical source is a pixelated image whose creator we can’t identify, whose integrity we can’t confirm, and whose purpose we can’t discern? In short, whether digitized images or born-­digital photographs can ever provide equally compelling and verifiable evidence when compared to physical images, is very much up in the air. 

Schmidt, Benjamin M. “Words Alone: Dismantling Topic Models in the Humanities.”

  • Topic modeling can give us an overview of the contents of large textual collections; it can provide some intriguing new artifacts to study; and it even holds some promise for structuring non-lexical data like geographic points. But simplifying topic models for humanists who will not (and should not) study the underlying algorithms creates an enormous potential for groundless — or even misleading — “insights.” Many humanists make two assumptions about a topic produced by topic modeling; first, that it is coherent, and second, that it is stable. Combined, these assumptions lead to a third; that the topic is meaningful. But topics fail our assumptions. Schmdt suggests two ways to bring words back to topic models in humanistic practice, to counteract some of the assumptions that make topic models so appealing. The first, using geographical data, shows the problems with labeling topics based on the top five to ten words and the ways that assumptions of meaningfulness and coherence are not grounded. The second shows the dangers of accepting a topic model’s assumption of topic stability across different sorts of documents. Extremely common practices, such as plotting topic frequencies across time, can elide dramatic differences in what words different documents actually use. In both cases, visualization that uses the individual word assignments, not just the topic labels, can help dramatically change the readings that humanists give to topics. Excitement about the use of topic models for discovery needs to be tempered with skepticism about how often the unexpected juxtapositions LDA creates will be helpful, and how often merely surprising. Topic models are no less ambiguous, no less fickle, and no less arbitrary than words. They require major feats of interpretation; even understanding the output of one particular model is a task which requires considerable effort. Whatever uses humanists find for topic models, in the end they must integrate the models with a close understanding of the constituent words; and only by returning to describe changes in words can they create meaning.