# Common Methods and Programs

## Git

Git is a free, open-source version control system available for download online. This tool can be used for managing a code base and can be extremely helpful when multiple people are working on the same project. It creates isolation for individuals to work on new features and allows them to easily merge those changes together. Learning to use this tool can be beneficial for anyone who needs to organize code.

## R

Using R to Connect to an SQL Server/MySQL Database

When working with data stored on an SQL server or MySQL database you can organize and export data from the database to a file that can be imported to R to run statistical analyses. However, rather than exporting the data to do this, you can use R to connect directly to the database. This can be beneficial for researchers who have an ongoing project for which data is being updated frequently. Using this method, researchers can gain direct access to their data without first having to export the data from the database.

Creating Graphics in R

If you are interested in learning how to create visually appealing figures and other graphics in R for papers, presentations, or just for fun, then learning how to use the packages ggplot, xyplot, and rcharts can be extremely useful. All three R packages allow the user to easily customize graphics.

Package Development

Creating a package in R can easily allow a researcher or regular R user to not only manage and share functions and datasets, but also save memory when not working with these functions or datasets by only loading what you need when you need it. Packages can be created and shared privately amongst colleagues that are working on the same research project and using the same functions for analyses and data management. Packages can also be used on a larger scale as a way to share statistical methodology in R.

Reproducible Research in R

Being able to replicate and reproduce research is a core method used to validate research findings. R can be used to facilitate the reproducible research process by using R packages designed for literate programming, which create documents that are combinations of data analysis code, the logic behind the methods, and instructions for reproduction. These tools can be used to create documents that can be submitted with research publications.

Creating S3 and S4 Class Objects in R

As with other object-oriented programming languages, classes serve as a template for creating objects in R. Depending on your needs, you may want to create S3 class or S4 class objects. The different classes allow for different levels of template-customization. S3 classes are simpler and easier to implement whereas S4 classes are more structured and complex. Learn which classes best suit your needs for your project.** **

Managing Big Data in R

If you are working with a very large dataset, learning how to use big data management tools in R can be extremely beneficial and time-saving. The data.table package in R allows a user to easily manage a large data set quickly. The plyr package can be used to split large datasets into subsets, apply functions to those subsets, and then combine the results or to quickly find summary statistics for different groups within a large dataset.

Caret Package for Data Mining

The caret (classification and regression training) package in R optimizes the process of creating predictive models in R, making it easier to explore large datasets for patterns.

**Python / Panda**

Data analysis can be done using the Python programming language and the Pandas package. The Pandas package can be very useful when working with tabular data, time series data, matrix data, and other forms of observational or statistical data sets.

Check out **this** guide to Python!

The International Methods Colloquium (IMC) is a periodic on-line interactive seminar discussion on the application of quantitative statistical methodology to the social sciences. Its goal is to make the discussion of cutting edge techniques and interesting applications accessible to scholars, students, and data scientists all over the world -- free of charge.

Good Enough Practices in Scientific Computing - target audience is users new to scientific computing and a good refresher on best-practices for researchers

Political Science is a social science dedicated to the study of political behavior in its many forms. Methods in political science are both humanistic (i.e., Normative political theory) and quantitative (i.e., Positive political theory). Research topics include voting and election behavior, regime types, human rights, globalization, government and economy, bureaucratic administration, political crises and scandals, national security and intelligence, public opinion, public policy, political partisanship, campaigning, constitutional law, and more.

**How do quantitative methods enhance political science?**

- Allow researchers to make more specific conclusions and recommendations to policymakers and institutional leaders
- Test formal political theories and intuitive understanding for psychological, economic, and sociological reality
- Offer a tool to provide generalizable and reliable insights into political systems of behavior
- Formally model the relationships between individual decisions and overarching policies

**Overview**

**Types of data**

*Categorical data*– non-numeric or count data that can be divided into groups. For example, political affiliation (democrat, republican, independent) would be a categorical variable. Categorical data may be unordered (“nominal” data) or rank ordered (“ordinal” data)*Continuous data*– a type of numerical data that can take any value within a given range. For example, time until an election can be measured using a continuous measure of time (24 hours, 6 minutes, 23 seconds, 10 milliseconds). Continuous data may have a true zero (“ratio” data) or no true zero (“interval” data), where ‘true zero’ indicates the absence of quality.*Discrete data*– a type of numerical data that can only take certain values in a given range. An example of discrete data would be the number of votes for a particular policy proposal. Votes can only be measured in whole number increments; it makes very little sense to that a particular individual gave .2 of a vote on a policy proposal.*Spatial/dynamic data*– spatial data is information that contains a geo-reference or geo-marker. In other words spatial data is spread out over physical space, and often over time. This type of data allows researchers to understand the associations of political trends and views with particular times and places.*Microdata*- survey data about individuals, including their beliefs, attitudes, and behaviors. While surveys exist in all forms of popular media, a good scientific survey will have carefully worded questions, reliability, and validity. These properties allow microdata to be analyzed.*Macrodata*– Also called ‘aggregate data’, macrodata are aggregated data from levels “above” the individual. For example, economic indicators such as inflation, unemployment, GDP, growth, population density, government spending and revenue, interest rates, and foreign aid capture the behavior of individuals, but go further and capture some information about an economic entity. Other forms of political macrodata may include the electoral system, political institutions, regime type, union density, labor market regulations, human/civil rights, political particularism, etc.

**Descriptive Analyses**

The following measures describe the relationships among variables, without the use of inference.

**Measures of central tendency**

*Mean*– the mean represents the numerical average of several observations. The arithmetic mean can be found by summing the observations and then dividing by the number of observations.*Trimmed Mean –*a mean calculated after omitting the most extreme cases of a distribution. For monthly income data with 5 observations: $50, $1000, $2000, $1500, $15000, a trimmed mean would first eliminate the end points ($50 and $15000). This technique reduces the impact of outliers on the mean, making it a valid procedure.*Median*– The median is the center value of a distribution, rather than an arithmetic mean. This measure of central tendency can be used to determine the average of ordinal data as well as continuous data (ratio & interval data).*Variation Ratio*– This measure is based on the mode, but rather than simply telling the scholar what the most frequently occurring category/value is, the variation ratio expresses the proportion of cases that*do not*fit into the modal category. The variation ratio is a way to examine how central a tendency the mode really is.*Mode*– A measure used to describe the central tendencies of categorical (nominal & ordinal) data. The mode refers to the category that occurs most frequently in a dataset. For example, if a researcher is interested in what which region, Europe, North America, or Asia has the highest GNP, she may count the number of countries with a GNP over 20,000 and assess which region they fall into.

**Measures of dispersion**

*Range -*The minimum value of a data set subtracted from the maximum value. This reflects the overall amount of variation in a data set. Useful only for continuous data.*Quantile range*– this is a measure of how tightly or loosely clustered the data are around a distribution. This is particularly helpful for determining whether the median is representative of the “average” in a dataset.*Variance*– The mean of squared deviations between each score and the mean. The greater the variance, the more dispersion around the mean value.*Standard Deviation –*provides a measure of variance around an arithmetic mean. Larger standard deviations indicate more variation around the mean, while smaller standard deviations indicate less variation around the mean. If there is wide variation (e.g., high standard deviation), the arithmetic mean may not truly reflect the “average” of a dataset.*Index of Diversity*– A measure of variability for categorical data, the ID (Index of Diversity) represents the likelihood that any two observations will fall into different categories. If category A has 100 observations and category B has 0, then ID is 0 because there is no chance that any observations will fall into different categories.*Index of Qualitative Variation*– A more precise variant of the ID that corrects for the number of possible categories. Like ID, the Index of Qualitative Variation (IQV) is interpreted as the probability that any two observations will fall into separate categories.

**Measures of Association Among Variables**

*Social Network Analysis (SNA)*– Maps entities or agents (“nodes”) through their common connections (“links”). For example, environmental activist groups may be connected by common membership, missions, or donors. Networks can have different structures and link types, depending on what the researcher wishes to explore.*Content Analysis*– content analysis is the systematic counting, assessment, and interpretation of the form and substance of communication. Content analysis is often applied to texts, to answer various questions. For example, quantifying the content of a presidential candidate’s speeches gives insight into what other political figures he or she is most similar to.*Lambda*– a measure of association between two nominal variables. Lambda’s values can vary between 0 and 1, where 0 indicates no relationship and 1 indicates a perfect association. For example, if there were a relationship between government type (dictatorship, republic, democracy) and civil rights, we would expect a large value of lambda.*Kruskal’s gamma*– This measure of association is for ordinal variables, or those that have rank but no fixed units between levels. It addresses the question: can we predict the ranking of cases for ordinal variable A if we know the ranking for variable B? Predictability ranges from -1 to +1, with 1 indicating ‘perfect agreement’ (as A increases, B increases) and -1 indicating a ‘perfect inversion’ (as A increases, B decreases). Gamma of 0 means there is no benefit to predicting B using A. For example, if knowing a voter’s level of approval on gun ownership is predictive of his or her stance on universal health care, then gamma should be either greater or less than zero.*Odds Ratio*– assesses the relationship between two dichotomous (categorical data with only 2 levels) variables. Odds ratios can often be interpreted as, “the odds of X are 1.5 times higher than the odds of Y.” Unlike most measures of relationship strength, 1 is an indication of no relationship, not 0.*Pearson’s Product-Moment correlation –*When data are on a ratio or interval scale, it is appropriate to describe the variables’ level of association using a correlation analysis. If a researcher is curious about the association between national debt and spending on government programs, these ratio data would be appropriate to enter into a correlation analysis.

**Inferential Analyses**

*Confidence Intervals*– A confidence interval is a range of values that is likely to include the real population value. Confidence intervals can be used to test hypotheses (e.g., true average household income is $30K per annum) by estimating an interval. If the estimate contains the hypothesized value (e.g., $30K), then the hypothesized value cannot be dismissed out of kind. If the confidence estimate excludes the hypothesized value, then it is likely that the hypothesized value is not within the realm of possibility.*Chi-square test*– Appropriate for testing the statistical significance of the association between two nominal variables. A Chi-square test compares the observed frequency of a variable to the expected frequency of a variable. If a researcher wanted to know if there was an association between gender (male, female, other) and partisanship (democrat, republican, other) the researcher would use a compare actual numbers of votes by gender and the expected values (50-50) if there were no relationship.*Z-test*– a test based on the ‘normal’ (or Gaussian) distribution. It is appropriate for tests of a single population or when the standard deviation of the population is known (or can be estimated from a sample size of 30 or more). A z-test can be used on gamma, which is a measure for how predictive one ordinal variable is of another.*T-test*– in order to assess whether two (t-distributed, as opposed to normally distributed) populations are different, a t-test may be appropriate. It is best when sample sizes are small (less than 30), or when the population standard deviation is not known/cannot be estimated. A t-test would be appropriate if a researcher were looking for significant differences among participants in focus groups.*Ordinary least squares regression*– In a**bivariate**case, OLS measures a linear relationship between one outcome measure (dependent variable) and one indicator (independent variable). For example, the relationship between environmental conflict and water scarcity could be analyzed with an OLS regression model. It is possible to extend this model to a**multivariate**case, where several indicators may be used. Unlike other statistical techniques, OLS regression can be used when indicator variables are of differing types (i.e., continuous AND categorical variables). A variety of questions can be answered using OLS.*Analysis of Variance (ANOVA)*– A special case of the general linear model (regression). The independent variables are categorical and the dependent measures are continuous. For example, an ANOVA would be the appropriate analysis to see if political donations in dollars (continuous variable) varied by county (categorical variable).

**In the News**

An article exploring the Linguistic properties of Donald J. Trump’s speeches: http://motherboard.vice.com/read/trump-linguistics

Historical patterns of marginal taxation rates across income levels: http://www.theatlantic.com/business/archive/2016/04/taxes-rich/478338/

How to qualitatively and quantitatively assess the political tendencies of the judiciary https://www.washingtonpost.com/news/monkey-cage/wp/2016/03/30/new-data-show-how-liberal-merrick-garland-really-is/

**Resources & Software implementation**

STATA - http://www.ats.ucla.edu/stat/stata/

GIS - http://www.arcgis.com/features/features.html - used for description, presentation, and analysis of spatial data

_{ }

**References**

Böhmelt, T. (2011). *Overview of the possibilities of quantitative methods n political science* [PowerPoint slides]. Retrieved from http://www.slideshare.net/environmentalconflicts/overview-of-the-possibilities-of-quantitative-methods-in-political-science

King, G. (1986). How not to lie with statistics: avoiding common mistakes in quantitative political science. *American Journal of Political Science, 30*, 666-687.

Manheim, J.B. (2008). *Empirical political analysis*. New York, NY: Pearson/Longman.

Johnson, J.B., Reynolds, H.T., & Mycoff, J.D. (2008). *Political science research methods*. Washington, DC: CQ Press.

Signorino, C.S. (1999) Strategic Interaction and the Statistical Analysis of International Conflict. *The American Political Science Review, 93*(2), 279-297.

Linguistics is the study of language, and quantitative methods allow linguists to understand language as discrete units for combination. While language may seem like a field that favors qualitative and descriptive approaches to research, it is a topic filled with information in want of quantitative analysis. The sounds of language (phonetics), the mental representation of meaningful words (semantics, lexical semantics), how linguistic units are combined to form meaning (syntax), how languages change over time (historical linguistics), and the way that languages are used by individuals (sociolinguistics), can all be explored using a variety of quantitative approaches.

**How do Quantitative Methods enhance the study of language?**

*Quantitative approaches can inform all levels of linguistic research. First, quantitative methods allow for the concise summary of any amount or type of data. Observations can be condensed and general trends revealed using measures of central tendency and dispersion. Such methods can also uncover the types of relationships among variables. When differences or relationships between two or more factors are observed, the logical question is, are those differences real or could they have arisen by chance? Quantitative methods enable these kinds of inferences. The following are examples of topics that quantitative methods could address:*

- How young children learn to use their native language or native languages
- Whether sound variants in a dialect of a language have become more mainstream, and at what rate the sound system is changing
- The flexibility or rigidness inherent to classes of morphemes (the smallest meaningful units of a language, which may or may not be complete words)
- The ways in which different socio-cultural groups use spoken language to express identity
- The relationship between gesture and spoken language
- Determine whether two or more languages are related, or whether language use is changing (i.e. historical linguistics)
- How language users comprehend ‘chunks’ of language, such as compounds, phrases, and sentences when those chunks are complete or incomplete
- The underlying expectations that listeners have about their conversation partners (e.g. A young boy saying, “Today I flew a plane” is much more unexpected than a young boy saying, “Today I flew a kite.”)
- Extrapolation of socio-historical themes, story architecture, and discourse structure from a corpus of written stories or interview transcripts.

## Overview

**Types of Data**

*Nominal data*– count data or data that can be divided into groups. Morpheme type (“free” vs “bound”) and language family (Germanic, Slavonic, Asian, etc.) are examples of categorical variables, as they can be divided into groups.*Ordinal data*– Data that can be grouped*and*rank ordered. For example, if a researcher is interested in the order of acquisition for certain words, like “mama,” “dada,” and “me”, these items could be rank ordered by time of acquisition and studied as frequencies (e.g., how often “dada” was the first word vs. “mama”).*Interval data*– Interval data is measured on a numerical scale that does not have a true zero (e.g., where ‘true zero’ means an absence of the quality). An individual’s familiarity with a word on a 1 to 7 scale (e.g. 1 = unknown; 7 = familiar and meaning known) would be an interval measure. It is a numeric representation of familiarity, but there is no true zero in the scale.*Ratio data*– A type of numerical data has both a true zero and equal spacing between units. In a study of prosody and stress-timing, duration (in miliseconds) of a syllable is an example of continuous data. Time is a ratio scale because a 0 indicates an absence of duration and the interval between miliseconds is non-arbittary. For example, a duration of 0 ms means an absence of duration and 40ms is precisely twice as long as 20ms.

## Descriptive Analyses

**Measures of central tendency**

*Mode*– A measure used to describe the central tendencies of categorical (nominal & ordinal) data. The mode refers to the category that occurs most frequently in a dataset. For example, if a researcher is interested in whether males or females are more flexible in what they consider a well-formed sentence, the researcher may ask equal numbers of females and males to judge a sentence as grammatical or not (yes/no). The most frequent answer, yes or no, sorted by gender, would be an example of the mode.*Median*– The median is the center value of an ordered list. Imagine a list of 5 durations in order from shortest to longest: 10 ms, 12ms, 14ms, 20ms, 45ms. The media would be 14, as it represents the geometric center of the list. This measure of central tendency can be used to determine the average of ordinal data as well as continuous data (ratio & interval data).*Mean*– the mean represents the arithmetic average of a sample of observations. The arithmetic mean can be found by summing the observations and then dividing by the number of observations. While many inferential tests are based on means, the mean is subject to the pull of extreme values. Referring back to our list of durations from above, the mean of this data set would be, 17.5 because the extreme value (45) exerts a stronger influence than the more typical values (10, 12, 14, 20).*Weighted mean*– like an arithmetic mean, but it can give greater strength to certain observations over others. For example, if one collects certainty ratings on a 1 to 7 scale, the average can be weighted so that more certain answers affect the mean more than uncertain answers.

**Measures of dispersion**

*Range*– The range is the difference between the maximum and minimum values in a data set. For example, if we have a set of data on vocal pitch (measured in Hertz, or Hz) where the highest and lowest values are 400 Hz and 399 Hz, the range of the data is 1 Hz.*Mean absolute deviation*– The absolute value of the mean minus each value in the data set, all divided by the total number of observations in the data set. This measure is not often used, but it provides an improvement over the range because it is based on all measures in the data set rather than just two (min and max).*Variance*– Like mean absolute deviation, but the deviations from the mean are squared before dividing by the number of observations in the dataset.*Root mean square (RMS)*– The root mean square is simply the square root of the variance. It is so called because it is the*square root*of squared deviations from the*mean*. It may also be known as ‘standard deviation’.

## Hypothesis Testing Methods

*z-test*– A z-test can be used to assess whether one (or two) normally distributed populations are different, when the sample size is large (greater than 30) and when estimates of the population standard deviation are known or estimable. This type of test converts data points into z-scores, which are representations of the deviation between a given value and the mean.*t-test*– in order to assess whether two (t-distributed, as opposed to normally distributed) populations are different, a t-test may be appropriate. It is best when sample sizes are small (less than 30), or when the population standard deviation is unknown or cannot be estimated. For example, if a researcher wants to know if the two vowel sounds are becoming merged, the researcher might compare vowel productions of young to old speaker of a language. However, the mean of “production” may not be knowable. In this case, a t-test would be appropriate.*Ordinary Least Squares Regression*– A regression attempts to predict a linear relationship between a dependent and an independent variable. A regression equation attempts to fit a straight line to a set of data points such that each point falls as close to the line as possible. When there is only one independent variable, this method as simply known as ‘linear regression.’ When there are two or more independent variables, this method is called ‘multiple regression’.*Analysis of Variance*– More commonly known as ANOVA, this analytic technique is a special case of the general linear model (i.e., special case of regression). The independent variables are categorical and have more than two levels, and the dependent measure must be continuous. For example, if researchers are interested in whether negative attitudes toward Pit Bull terriers (a breed of dog sometimes portrayed as uncommonly violent) are associated with a person’s attitudes toward what Pit Bulls represent (e.g., violent crime, drugs), an ANOVA would be appropriate if attitudes were measured by continuous scale.*Chi-square test*– This type of test is appropriate for testing the statistical significance of the association between two categorical variables. A Chi-square test compares the observed frequency of a variable to the expected frequency of a variable. For example, we might expect the number of males and females with large and small social networks to be about the same (i.e., in a sample of 12 males and 12 females, 6 males and 6 females might have large networks and 6 males and 6 females might have small social networks). If a sociolinguist wants to know if the observed network sizes (small vs large) are significantly different from chance for males and females, a Chi-square test would be appropriate.*Linear Mixed Effects Regression*– Like a standard regression model, a linear mixed effects (LME) model attempts to predict a dependent measure from one or more independent measures. However, LME has an advantage over regression because it permits both*fixed effects*(independent variables) and*random effects*(sources of variation or noise in the data) in the analysis. What this means is that LME models will always be truer representatives of the data because it uses a more nuanced specification of the data. These models are complex, but typically yield superior results to ANOVAs and regression.

## Pattern Discovery Methods

*Pearson’s R Correlation*– Pearson’s R will represent a simple linear relationship between two variables.*R*will always range from -1 to +1, with 0 indicating no relationship. If*r*is positive, it means that two variables go up and down together. For example, if the number of exposures to a word is positively correlated with familiarity ratings of that word, it would mean that as exposures increase, so too does familiarity rating. However, if number of exposures to a word is negatively correlated with reaction time to that word, that would indicate that as number of exposures increase, response time decreases (gets faster).*Principal Components Analysis*– Often times in linguistic and other data, the measurements we use to measure are imperfect. The principle behind Principal Components Analysis (PCA) is this: how many dimensions are there to the data? For example, a language development researcher might want to understand how phonemic awareness predicts reading ability. However, there may be four different scales of reading ability available, and, not knowing which is the best, the researcher uses all four. PCA uses correlations among measures to arrive at a reduced understanding of dimensions (‘principal components’) to the date. In our language development example, if the reading ability scales measure the same thing, they may all be part of the same dimension.*Cluster Analysis*– Most commonly used in historical linguistics, this tool is also used in population biology and epidemiology. Cluster analysis attempts to group similar sets with one another. For example, if one ones to trace the lineage of European languages, one way would be to cluster them based on similarity (e.g., number of shared cognates) under the assumption that languages that are more similar must have split more recently and dissimilar languages must have split more distantly.*Cladistics*– Another technique that historical linguistics shares with population biology, cladistics is the practice of maximizing the simplicity and the compatibility of various tree models to arrive at an understanding of how languages are related. Cladistics begins with a set of “characters” (i.e., features) associated with each language. Tree structures are then generated to nest the similarity of linguistic features with each other. By this method, family trees of languages can be devised. It is important to choose characters well, as the tree is only as good as the characters chosen.*Multi-dimensional scaling*– Multidimensional scaling (MDS) is used to plot the similarity among items in a dataset. The advantage of MDS over cladistics or cluster analysis is that MDS does not assume a hierarchical structure. For example, German and Dutch could be highly similar without being forced into a hierarchy as siblings.*Quantitative Narrative Analysis*– Abbreviated as QNA, this technique is primarily employed by discourse analysts to extract structure and information from corpuses (large bodies of narrative or other written work). QNA explores lexical items, syntax, and the relationships within and between narratives.

## Resources & Tools

R for Statistical Computing – This link will lead you to the latest download of R. This software is free and customizable, and provides powerful functionality. For these reasons, it is quickly becoming the standard for quantitative linguistic analysis. Many of the analysis techniques described above come with the default R packages, but others can be downloaded to supplement its quantitative capabilities.

R Tutorial: MDS – A helpful walkthrough of how to perform multidimensional scaling with the default packages in R.

PHYLIP package – R instantiation of the PHYLIP software, which is much older. This package contains powerful tools for tracing the history of languages, including phylogeny inference methods, tree drawing methods, and ways to select the best tree structure.

APE package documentation – R package used by biologists, but also used by linguists to model hierarchical relationships among language data.

Lme4 package documentation – R package for use with fitting mixed effects model to linguistic data.

## References

Baayen, R.H., Davidson, D.J., & Bates, D.M. (2008). *Mixed-effects modeling with crossed random effects for subjects and items.* Journal of Memory and Language, 59, 390-412.

Franzosi, R. (2010). *Quantitative Applications in the Social Sciences: Quantitative narrative analysis*. SAGE Publications Ltd, doi: 10.4135/9781412993883

Johnson, K. (2008). *Quantitative methods in linguistics*. Malden, MA: Blackwell Publishing.

Tėšitelová, M. (1992). *Linguistic and Literary Study in Eastern Europe: Quantitative Linguistics*. Amsterdam, Holland: John Benjamins Publishing Company.

The primary focus of cultural and ethnic studies are to understand the social, philosophical, literary, theological, historical components of culture as a unified whole, as a survival mechanism, and as an individual’s performance. Typically thought of as a humanistic field, or one that emphasizes qualitative analyses over the quantitative, quantitative methods can supplement experiential analyses. The use of quantitative methods is still somewhat nascent in the field of cultural end ethnic studies.

**How do Quantitative Methods enhance the study of Culture and Ethnicity?**

*Some individuals view quantitative methods with suspicion, or view the scientific method as culturally biased, and therefore inappropriate for cultural or ethnic studies. Indeed, quantitative methods are a ‘garbage in, garbage out’ system: if the research that produced the data stands on shaky ground, then the results cannot be supported. However, when used appropriately, quantitative methods offer another form of evidence that can be marshaled in support of experiential observations. Quantitative methods can be used to pursue the following questions already of interest to scholars of culture and ethnicity:*

- Understand how individuals define culture, and whose definitions of culture have the greatest weight.
- Evaluate, compare, and predict what cultural changes are likely to emerge in a given context
- Assess the topics of narratives in a socio-historical context
- Explore how media consumption and self-identity relate to cultural context
- Examine the spread of a cultural construct or artifact over time and space

## Overview

**Types of data**

*Categorical data*– non-numeric or count data that can be divided into groups. For example, ethnic identity (Irish, Jewish, Scottish, Welsh, British, etc.) would be a categorical variable. Categorical data may be unordered (“nominal” data) or rank ordered (“ordinal” data).*Textual data*– Narratives collected from interviews, stories or novels written in a particular time period, historical documents, newspaper articles, autobiographies, and any other written cultural artifact can be counted as data. Textual data is a subset of categorical data, but is important enough to the study of culture and ethnicity that it warrants a specific mention.*Continuous data*– a type of numerical data that can take any value within a given range. Amount of time spent engaged in work activities or income levels are examples of continuous data. Continuous data may have a true zero (“ratio” data) or no true zero (“interval” data), where ‘true zero’ indicates the absence of quality.*Discrete data*– a type of numerical data that can only take certain values in a given range. The number of Black men represented in British dramas is an example of discrete data. It makes little sense to measure people in fractional increments, as .2 of an individual has no real world significance.

## Descriptive Analysis

**Text analysis**

*Content Analysis*– Cultural and ethnic studies rely heavily on textual or visual artefacts. Content analysis is a way to understand what is written or depicted by counting phenomena in text (e.g., the umber of stories, the number of images, the number of unique nouns, etc.). For example, a researcher could use content analysis to assess the language surrounding sports coverage in various media outlets.*Topic Modeling*– a text analysis method that, in an unsupervised way, discovers the semantic structures that underlie content. Topic models assume that within each ‘document,’ there are many topics and each word as a probability of being sorted into a topic. A scholar might use this technique to uncover the implicit content categories in a narrative or interview as a window into the cultural milieu.*Sentiment Analysis*– a type of text analysis that is primarily concerned with uncovering the subjective opinions, emotions, and attitudes contained within a text. This can be done with either simple (“positive” vs. “negative”) polarity analysis, or with a complex categorization structure.

**Other Descriptive Measures**

*Mean*– the mean represents the numerical average of several observations. The arithmetic mean can be found by summing the observations and then dividing by the number of observations.*Median*– The median is the center value of a distribution, rather than an arithmetic mean. This measure of central tendency can be used to determine the average of ordinal data as well as continuous data (ratio & interval data).*Mode*– A measure used to describe the central tendencies of categorical (nominal & ordinal) data. The mode refers to the category that occurs most frequently in a dataset. For example, if a researcher is interested in what which ethnicity (Asian, Black, Hispanic) has the most positive view of police, he might count the number of individuals who express a “favorable” view and sort those individuals by ethnicity.

## Inferential Analysis

*Chi-square test*– Appropriate for testing the statistical significance of the association between two categorical variables. A Chi-square test compares the observed frequency of a variable to the expected frequency of a variable. If a researcher wanted to know if there was an association between a child’s gender (male, female) and the brand names of toys (e.g., Teenage Mutant Ninja Turtles, Barbie, and Legos) they request in letters to Santa Claus, the researcher would compare actual numbers of requests by gender and the expected values (33-33-33, which would be “chance level” for 3 options) to uncover a relationship.*t-test*– in order to assess whether two (t-distributed, as opposed to normally distributed) populations are different, a t-test may be appropriate. It is best when sample sizes are small (less than 30), or when the population standard deviation is not known/cannot be estimated. If a researcher wanted to know if there was an association between Asian American college students’ genders (male, female, other) and the amount of cultural values conflict they feel, a t-test may be appropriate.*Analysis of Variance*– more commonly known as ANOVA, this analytic technique is a special case of the general linear model (regression). The independent variables are categorical and the dependent measures are continuous. For example, if researchers are interested in whether negative attitudes toward Pit Bull terriers (a breed of dog sometimes portrayed as uncommonly violent) are associated with a person’s attitudes toward what Pit Bulls represent (e.g., violent crime, drugs), an ANOVA would be appropriate if attitudes were measured by continuous scale.

## Resources & Tools

http://sentiment.vivekn.com/ Free, online sentiment analyzer tool (positive/negative polarity)

http://tapor.ca/home Tools for text analysis (including entiment analysis, social media analysis, and Wikipedia scraping tools)

http://www.ats.ucla.edu/stat/r/ - useful information on how to use the stats program R to carry out various quantitative functions

## References

Cohen, J. & Richardson, J. (2002). Pit Bull panic. *The Journal of Popular Culture, 36*(2), 285-317.

Otnes, C., Kim, Y.C., & Kim, K. (1994). All I want for Christmas: an analysis of children’s brand requests to Santa Claus. *The Journal of Popular Culture, 27*(4), 183-194. http://www.uni-klu.ac.at/mim/downloads/otnes_1994.pdf

Pickering, M. (Ed.). (2008). *Research Methods for Cultural Studies.* Edinburgh, UK: Edinburgh University Press. https://lcst3789.files.wordpress.com/2012/01/pickering_ed__research_methods_in_cultural_studies.pdf

Rahman, Z. & Witenstein, M.A. (2012). A quantitative study of cultural conflict and gender differences in South Asian American college students. *Ethnic and Racial Studies, 37*(6), 1121-1137. http://dx.doi.org.proxy.library.emory.edu/10.1080/01419870.2012.753152

Stokes, J. (2003). *How to do Media and Cultural Studies*. London, UK: Sage. http://site.ebrary.com/lib/emoryac/reader.action?docID=10131993

The focus of much quantitative work in the field of English and Literary Studies is text and corpus analysis. Such analyses offer a variety of advantages to the inquisitive literary scholar.

**How do quantitative methods enhance the study of literature?**

*While quantitative methods ought not supplant traditional methods of critical reading and qualitative analysis, quantitative methods provide another set of tools with which to understand a text.*

- Significantly faster (and in some cases, more accurate) than traditional qualitative analysis techniques of literature, but based on the same fundamentals of traditional qualitative techniques
- Reveal true differences in trends, patterns, etc. in the text that are not due to chance or random variation alone
- Determine authorship, a practice known as forensic (or literary) authorship attribution
- Study the relationship among manuscripts, including evolution of a work, author, character, vocabulary, etc. over time
- Assess the reality of literary divisions, concepts, and time periods
- Nominate key words, phrases, structures, etc., that contribute to classification of a particular genre
- Quantify the writing style of an author, time period, or genre

**Overview**

**Defining ‘the text’**

The ‘text’ can refer to multiple constructs. It can be a single essay, a poem, or a book chapter; it can refer to the sum total works by a particular author, or the works by authors from a specific genre of literature. It can even be a body of work (i.e., a corpus) from a historical time period. Nearly any collection of words can be quantified and analyzed as a ‘text’ or a corpus. For ease of computation, quantitative analyses of literary texts are typically performed on digital copies (although see Yule, 1944 for computational analysis performed without the benefit of computers).

**Preparing Text for Analysis**

*Frequency and count data*- The most straightforward way to quantitatively analyze text is to count various features. For example, semantic topics, nouns, narrators, characters, plot elements, punctuation, etc. Nearly anything can be counted and assessed, though the researcher must give careful thought to the question he or she seeks to answer by quantifying the text.*Averages and means*- The average number of declarative sentences in a story might be diagnostic of a particular genre or writing style. As is true of any quantitative analysis, the data are only as good as the research question.*Ratios –*The relationship of a feature accounting for another feature. For example, the ratio of metaphor usage to figurative language in general may be more diagnostic of a particular author than simply how many times metaphors are used.*Z-score normalization*- this type of transformation “centers” the data around the arithmetic mean (average). The advantage of z-score normalization is that it permits comparison across multiple data types that may initially have scales to divergent to compare. Z-score normalized data can be compared to any other data that has been z-scored normalized.

**Descriptive Analysis of Text**

**Measures of Dispersion & Uniqueness**

*Standard deviation*– assess the relative difference between texts on some measure (e.g. use of simile) using the average deviance from the mean for each text*Distinctiveness Ratio*– a distinctiveness ratio (DR) is calculated by dividing the frequency of a feature of interest in one text by its frequency in another. The DR is of interest when it is above 1.5 or below .67, which would indicate that two texts are fairly distinct.*Yule's characteristic constant K*– based on the observation that most authors have a unique mixture of vocabulary or features, Yule’s*K*provides a measure of vocabulary diversity. Small values of Yule K’s indicate high diversity, whereas large values indicate low diversity. Unlike a distinctiveness ratio, Yule’s*K*is relatively independent of text length, so it can be used for texts of varying lengths.*Rényi's Entropy*– A complexity-based measure related to Yule’s*K*. It provides a measure of ‘uncertainty’ (i.e. the level of unpredictability) between words or characters in a text.*Delta*– measure of textual differences based on word frequency. This measure is useful for identifying authorship when there are no proof or suggestion of any author’s claim to an anonymous work, and is based on the most frequent words.*Zeta*– Zeta provides a measure of word diversity, and “zeta” words are typically ones with a middling frequency of occurrence in the language. The zeta measure is based on the notion that the frequency of any word is inversely proportional to its rank in a frequency table.*Iota*– Like Delta and Zeta, Iota is based on word frequency. However, Iota focuses on relatively rare words (i.e., low frequency words) for the purposes of identifying an author’s characteristic style.

## Inferential Analysis of Text

**Inferential Statistics**

*Student’s T-test*– normal interval data (data where the distance between units is measurable and data is normally distributed) can be passed through a t-test to see if the difference between two independent (or dependent) groups is more than would be expected by chance. For example, comparing the mean use of relative clauses in one text to another.*Chi-square Test*– A chi-square test may be performed on categorical data (data that may be divided into groups). If a researcher is investigating whether there is a difference among e.e. cummings, Christoper Morley, and Isaac Rosenberg in standard and non-standard capitalization use, a Chi-square test would be an appropriate comparison tool.

**Machine Learning Techniques**

*Sometimes, literary scholars will use a classifier, or machine learning technique, to group certain features of a text or texts together and arrive at a deeper understanding of the components that comprise a class. For example, a scholar might use a topic analysis technique over several pieces* *by* *Kenzaburo Oe in an attempt to formalize the characteristic themes in his novellas, essays, and short stories. The following machine learning techniques have been used for the purposes of literary analysis**:*

- Cluster analysis (by author, by literary device, by function of literary device) -
- Principal Components Analysis (PCA) – PCA
- Discriminant Analysis
- Topic Modeling
- Factor Analysis
- Neural networks
- Machine learning (typically Naïve Bayes or Support Vector Machines (SVM))

**In the News**

Quantitative techniques reveal J.K. Rowling to be the author of a detective novel written under the pseudonym Patrick Juola: https://www.fastcompany.com/3016699/should-we-teach-literature-students-how-to-analyze-texts-algorithmically

**Resources & Tools for Text Analysis**

Resource for discovering tools to use for studying text - http://tapor.ca/home

Fun, free, and professional web-based text analyzer - https://voyant-tools.org/

**References**

Barros, L., Rodriguez, P., & Ortigosa, A. (2013). Automatic classification of literature pieces by emotion detection: a study on Quevedo's poetry. In *2013 Humaine Association Conference on* *Affective Computing and Intelligent Interaction (ACII),* (141-146). IEEE. URL: http://ieeexplore.ieee.org.proxy.library.emory.edu/stamp/stamp.jsp?arnumber=6681421

Burrows, J. (2002). ‘Delta’: A measure of stylistic difference and a guide to likely authorship. *Literary and linguistic computing*, *17*(3), 267-287. URL: http://llc.oxfordjournals.org/content/17/3/267.full.pdf

Hoover, D.L. (2008). Using the Zeta and Iota Spreadsheet. *Retrieved from* *https://wp.nyu.edu/exceltextanalysis/zetaiotawidespectrum/usingzetaiota/**.*

Miranda-García, A. & Calle-Martín, J. (2005). Yule's characteristic K revisited. *Language Resources and Evaluation,* *39(*4), pp. 287-294 URL: https://www.jstor.org/stable/pdf/30204534.pdf

Sinclair, S. (2003). Computer-assisted reading: reconceiving text analysis. *Literary and Linguistic Computing, 18*(2), 175-184. URL: http://llc.oxfordjournals.org/content/18/2/175.full.pdf

Tanaka-Ishii, K., & Aihara, S. (2015). Computational Constancy Measures of Texts—Yule's K and Rényi's Entropy. *Computational Linguistics*. URL: http://www.mitpressjournals.org/doi/pdf/10.1162/COLI_a_00228

Hooper, D.L. (2008). Quantitative Analysis and Literary Studies. In S Schreibman and R Siemens (Eds.), A Companion to Digital Literary Studies. Oxford: Blackwell. *Retrieved from* http://www.digitalhumanities.org/companionDLS/

Yu, B. (2008). An evaluation of text classification methods for literary study. *Literary and Linguistic Computing*, *23*(3), 327-343. URL: http://llc.oxfordjournals.org/content/23/3/327.full.pdf

Yule, G. U. (1944) *The Statistical Study of Literary Vocabulary*, Cambridge: Cambridge University Press.