It was a great joy to return to the University of Amsterdam and give this talk to my old friend Richard Rogersand his 100+ attentive workshop attendees.
We interviewed researchers at the University of Illinois Chicago in the Health Media Collaboratory about their use of DiscoverText and the Gnip-enabled Power Track for Twitter to study smoking behavior. The team, led by Dr. Sherry Emery, explains why it is important to train and use custom machine classifiers to sort the millions of tweets they are collecting from the full Twitter fire hose. The UIC team strongly argues for the combination of good tools and highly reliable data.
Just in time for the 2012 GOP convention, we are running a special offer to provide full Twitter fire hose access via the Gnip-enabled Power Track for Twitter:
Never miss a tweet. Full coverage with no rate limits. Powerful search rules, text analytics, clustering and machine-learning via custom machine classifiers.
Check out this video introducing our latest experiments creating custom social “sifters” to winnow down a lake of social media data and leave behind only those items that are truly responsive to your search. Great tool, or greatest tool ever? You be the judge.
DiscoverText is rolling-out an addition to its analytical toolkit: random sampling. The Web-service already offers an array of tools for text analytics and rigorous, team-based qualitative data analysis. These functions include the ability to code and annotate text, measure inter-rater reliability, adjudicate coder validity, attach memos to text, cluster duplicate and near-duplicate documents, share documents, and to classify text using an active-learning Naive-Bayesian classifier. While still in beta, random sampling is a key new addition.
After DiscoverText users amass extraordinary amounts of social media data (for example via the Public Twitter API, the GNIP Powertrack, or the Facebook Social Graph), they can now more easily extract a random sample for analysis. The size of the sample is decided by the user in order to accommodate to iteration, experimentation and other scientific methods. The option is streamlined into the dataset creation process. On the new dataset creation page, you see a sample size prompt.
This additional method for data prep and analysis augments current information retrieval techniques, such as search with advanced filtering. It also builds up our framework for expanding available NLP methods from straightforward Bayesian classification, which aims to analyze substantial quantities of data in their original bulk-form, to a menu of computationally intensive methods that can iterate more quickly and effectively against random data samples. For example, the LDA topic model tool we are releasing will be faster and more effective against smaller random samples.
This new feature accommodates both an additional analytical approach as well as the opportunity to easily compare results between competing (or complimentary) analytic methods. We look forward to experimenting with this new tool and hearing about how random sampling will enhance the research of our users and users to come.
Special Note to DT Users: We need to turn this feature on one account at a time while we are testing it. Drop us a line if you want to try the tool.
We’ll keep you posted on the launch as more dataset modifications are pushed live. As always, if you have any questions, feel free to email us anytime at firstname.lastname@example.org. Your feedback is crucial. Sign up and try it out for yourself at discovertext.thrivehivesite.com.
Researchers interested in large text collections and their itinerant coders tend to muddle through with limited collaborative, cross-disciplinary resources upon which to draw. The generic criteria for high-quality codebook construction and effective coding are underdeveloped, even as the tools and techniques for measuring the limits of manual or machine coding grow ever more sophisticated. In that paradox there may be the seed of a partial solution to some of these issues. The ability to quickly and easily pre-test coding schemes and produce on-the-fly displays of coding inconsistencies is one way to more uniformly train coders to perform reliably (hence usefully) while ensuring a satisfactory level of valid observations. By the same token, the ability to permit an unlimited number of users to review or replicate all the coding and adjudication steps using a free, web-based platform would be a large and bold step onto our methodological and metaphorical bridge.
What are needed are more universal annotation metrics, a standard lexicon, and widely shared, semi-automated coding tools that make the work of humans more useful, fungible, and durable. Ideally, these tools would be interoperable, or combined in a single system. The new system would allow human coders to create annotations and allow other experts to efficiently examine, influence, and validate their work. At a deeper level, this calls for much better and more transparently codified approaches to training and deploying coders—an annotation science subfield—so that a more coherent and collaborative research community can form around this promising methodological domain.
Investigators in the social sciences use reliably coded texts to reach inferences about diverse phenomena. Many forms of public-sphere discourse and governmental records are readily amenable to coding; these include press content, policy documents, speeches, international treaties, and public comments submitted to government decision-makers, among many others.
Systematic analysis of large quantities of these sorts of texts represents an appealing new avenue for both theory building and hypothesis testing. It also represents a bridge across the divide between qualitative and quantitative methodologies in the social sciences. These large text datasets are ripe for mixed-methods work that can provide a rich, data-driven approach both to the macro and micro view of large-scale political phenomena.
Traditionally, social scientists working with text use a variety of qualitative research methods for in-depth case studies. For many legitimate and pragmatic reasons, these studies generally consist of a small number of cases or even just a single case. As Steven Rothman and Ron Mitchell note, the reliability of data drawn from qualitative research comes under greater scrutiny, as increased dataset complexity requires increased interpretation and, subsequently, leads to increased opportunity for error. The case study method is plagued by concerns about limitations on its external validity and the ability to reach generalized inferences. With the proliferation of easily available, large-scale digitized text datasets, an array of new opportunities exist for large-n studies of text-based political phenomena that can yield both qualitative and quantitative findings.
More to the point, high-quality manual annotation opens up the possibility for cross-disciplinary studies featuring collaboration between social and computational scientists. This second opportunity exists because researchers in the computational sciences, particularly those working in text classification, IR, opinion detection, and NLP, hunger for the elusive “gold standard” in manual annotation. Accurate coding with high levels of inter-rater reliability and validity is possible. For example, work by the eRulemaking Research Group on near-duplicate detection in mass e-mail campaigns demonstrated that focusing on a small number of codes, each with a clear-cut rule set, has been able to produce just such a gold standard.
Reliably coded corpora of sufficient size and containing consistently valid observations are essential to the process of designing and training NLP algorithms. We are likely to see more political scientists using methodologies that combine manual annotation and machine learning. In short, there are exciting possibilities for applied and basic research as techniques and tools emerge for reliably coding across the disciplines. To unleash the potential for this interdisciplinary approach, a research community must now form around the nuts and bolts questions of what and how to annotate, as well as how to train and equip the coders that make this possible.
We did it! The free, open source, Web-based, university-hosted, FISMA-compliant “Coding Analysis Toolkit” CAT recorded its one millionth coding choice.
Pretty much all the credit goes to Texifter CTO and chief CAT architect Mark Hoy who has put in many paid (and unpaid) hours making sure CAT is reliable, usable, & scalable. Texifter Chief Security Officer Jim Lefcakis also played a key role ensuring the hardware and server room were maintained at the highest level of reliability and security. In honor of this milestone, I have been digging through my unpublished papers looking for material that explains in more detail where CAT, PCAT, DiscoverText, QDAP & Texifter come from. This post is the first in a series about the particular approach to coding text we have come to call the “QDAP method.”
Large political text data collections are coded and used for basic and applied research in social and computational sciences. Yet the manual annotation of the text—the coding of corpora—is often conducted in an ad hoc, inconsistent, non-replicable, invalid and unreliable manner. Even the best intentions to create the possibility for replication can, in practice, confound the most ardent followers of the creed “Replicate, Replicate.” While mechanical, process, documentary, and other challenges exist for all approaches, practitioners of qualitative or text data analysis routinely profess to greater, even insurmountable, barriers to re-using coded data or repeating significant analyses.
There are diverse approaches to coding text. They tend to be hidden away in small niche sub-fields where knowledge of them is limited to a small research community, a project team, or even a single person. While researchers classify text for a variety of reasons, it remains very difficult, for many counter-intuitive, to share these annotations with other researchers, or to work on them with partners from other disciplines for whom the coding may serve an alternate purpose. A change in the way the researchers think about, conduct, and share coded political corpora is overdue.
Coding is expensive, challenging, and too often idiosyncratic. Training and retaining student coders or producing algorithms capable of tens of thousands of reliable and valid observations requires patience, funding, and a framework for measuring and reporting effort and error. Given these factors, it is not surprising that a proprietary model of data acquisition and coding still dominates the social sciences. Despite the important role for the social in social science, researchers guard “their” privately coded text, even the raw data, fearing others will beat them to the publication finish line or challenge the validity of their inferences. The competitive approach to producing and failing to share annotations disables intriguing and highly scalable collaborative social research possibilities enabled by the Internet.
Researchers should seek to enhance and modernize their architecture for large-scale collaborative research using advanced qualitative methods of data analysis. This will require working out and attaining widespread acceptance of Internet-enabled data sharing protocols, as well as the establishment of free, open source platforms for coding text and for sharing and replicating results. We believe that when utilized in combination, “The Dataverse Network Project” and the “Coding Analysis Toolkit” (CAT) represent two important steps advancing that effort. Large-scale annotation projects conducted on CAT can be archived in the Dataverse and as a result will be more easily available for replication, review, or re-adjudication of their original coding.
In Part Two of the Series “Coding Text the QDAP Way,” we’ll say more about the role of scholarly journals advancing this practice of re-using datasets.