Common Methods Across Disciplines

Curious about programming, statistical software, or quantitative methods used in different disciplines? This page gives a brief overview of different (common) methods and programs for those who may be interested in learning these techniques without prior experience.

The International Methods Colloquium (IMC) is a periodic on-line interactive seminar discussion on the application of quantitative statistical methodology to the social sciences. Its goal is to make the discussion of cutting edge techniques and interesting applications accessible to scholars, students, and data scientists all over the world -- free of charge.

Good Enough Practices in Scientific Computing - target audience is users new to scientific computing and a good refresher on best-practices for researchers

Types of Data

Non-numeric or count data that can be divided into groups. For example, ethnic identity (Irish, Jewish, Scottish, Welsh, British, etc.) would be a categorical variable. Categorical data may be unordered (“nominal” data) or rank ordered (“ordinal” data).

Data that can be grouped and rank ordered. For example, if a researcher is interested in the order of acquisition for certain words, like “mama,” “dada,” and “me”, these items could be rank ordered by time of acquisition and studied as frequencies (e.g., how often “dada” was the first word vs. “mama”).

Also called "interval data," discrete data are a type of numerical data that can only take certain values in a given range. The number of Black men represented in British dramas is an example of discrete data. It makes little sense to measure people in fractional increments, as .2 of an individual has no real world significance.

A type of numerical data that can take any value within a given range, and is also called "ratio data".  Amount of time spent engaged in work activities or income levels are examples of continuous data. Continuous data may have a true zero (“ratio” data) or no true zero (“interval” data), where ‘true zero’ indicates the absence of quality.

Narratives collected from interviews, stories or novels written in a particular time period, historical documents, newspaper articles, autobiographies, and any other written cultural artifact can be counted as data. Textual data is a subset of categorical data, but is important enough to the study of culture and ethnicity that it warrants a specific mention.

Spatial data is information that contains a geo-reference or geo-marker. In other words spatial data is spread out over physical space, and often over time. As an example, this type of data allows researchers to understand the associations of political trends and views with particular times and places.

Also called Aggregate Data, Macrodata are aggregated data from levels “above” the individual. This distinction is primarily used in Political Science and Economics and encompasses categorial, ordinal, continous, and discrete data types. For example, economic indicators such as inflation, unemployment, GDP, growth, population density, government spending and revenue, interest rates, and foreign aid capture the behavior of individuals, but go further and capture some information about an economic entity.

Information about individuals, including their beliefs, attitudes, and behaviors obtained through carefully constructed surveys. While surveys exist in all forms of popular media, a good scientific survey will have carefully worded questions, reliability, and validity. Microdata can be categorial, continous, or discrete data, and the distinction between individual-level and country-level data is primarily used in Political Science and Economics.

Descriptive Statistics

Measures of Central Tendency

When people say "mean" or "average," they are typically referring to the arithmetic mean. The arithmetic mean can be found by summing the observations and then dividing by the number of observations. Only continuous data can be summarized with the arithmetic mean. A special case of the arithmetic mean is a weighted mean, which can give greater strength to certain observations over others. For example, if a researcher collects certainty ratings on a 1 to 7 scale, the average can be weighted so that answers associated with high certainty affect the mean more than uncertain answers. Researchers may also opt to compute a trimmed mean - a mean calculated after omitting the most extreme cases of a distribution. For monthly income data with 5 observations: $50, $1000, $2000, $1500, $15000, a trimmed mean would first eliminate the end points ($50 and $15000).

The median is the central value of a distribution, once every value is organized by rank from lowest to highest. The median can be used to determine the central tendency of ordinal data as well as continuous data (ratio & interval data), and is often a valuable measure to use when there are extreme values that are affecting the mean.

A measure used to describe the central tendencies of categorical (nominal & ordinal) data. The mode refers to the category that occurs most frequently in a dataset. For example, if a researcher is interested in what which ethnicity (Asian, Black, Hispanic) has the most positive view of police, he might count the number of individuals who express a “favorable” view and sort those individuals by ethnicity.

Measures of Dispersion

The range is the difference between the maximum and minimum values in a data set. This reflects the overall amount of variation in a data set. Interquartile range is calculated by subtracting the value for the third quartile from the value of the first quartile. These metrics of dispersion should be used with the median rather than the mean and are useful only for continuous data.

The mean of squared deviations between each score and the mean. The greater the variance, the more dispersion around the mean value.

The standard deviation indicates variance around an arithmetic mean. Larger standard deviations indicate more variation around the mean, while smaller standard deviations indicate less variation around the mean. If there is wide variation (e.g., high standard deviation), the arithmetic mean may not truly reflect the “average” of a dataset. Because the calculation for standard deviation requires comparison against the mean, it should not be used when median is the measure of central tendency. In Linguistics, the standard deviation is often called root mean square.

Standard Error is a standard deviation for a sampling distribution, That is, it is a measure of standard deviation a population parameter rather than a sample statistic. Often, this is more useful than a standard deviation because most researchers care about generalizing beyond a sample to a population.

Inferential Statistics

Appropriate for testing the statistical significance of the association between two categorical variables. A Chi-square test compares the observed frequency of a variable to the expected frequency of a variable. If a researcher wanted to know if there was an association between a child’s gender (male, female) and the brand names of toys (e.g., Teenage Mutant Ninja Turtles, Barbie, and Legos) they request in letters to Santa Claus, the researcher would compare actual numbers of requests by gender and the expected values (33-33-33, which would be “chance level” for 3 options) to uncover a relationship.

A confidence interval is a range of values that is likely to include the real population value. Confidence intervals can be used to test hypotheses (e.g., true average household income is $30K per annum) by estimating an interval. If the estimate contains the hypothesized value (e.g., $30K), then the hypothesized value cannot be dismissed out of kind. If the confidence estimate excludes the hypothesized value, then it is likely that the hypothesized value is not within the realm of possibility.

In order to assess whether two (t-distributed, as opposed to normally distributed) populations are different, a t-test may be appropriate. It is best when sample sizes are small (less than 30), or when the population standard deviation is not known/cannot be estimated. If a researcher wanted to know if there was an association between Asian American college students’ genders (male, female, other) and the amount of cultural values conflict they feel, a t-test may be appropriate.

a test based on the ‘normal’ (or Gaussian) distribution. It is appropriate for tests of a single population or when the standard deviation of the population is known (or can be estimated from a sample size of 30 or more). A z-test can be used on gamma, which is a measure for how predictive one ordinal variable is of another.

In a bivariate case, OLS measures a linear relationship between one outcome measure (dependent variable) and one indicator (independent variable). For example, the relationship between environmental conflict and water scarcity could be analyzed with an OLS regression model. It is possible to extend this model to a multivariate case, where several indicators may be used. Unlike other statistical techniques, OLS regression can be used when indicator variables are of differing types (i.e., continuous AND categorical variables). A variety of questions can be answered using OLS.

This analytic technique is a special case of the general linear model (regression). The independent variables are categorical and the dependent measures are continuous. For example, if researchers are interested in whether negative attitudes toward Pit Bull terriers (a breed of dog sometimes portrayed as uncommonly violent) are associated with a person’s attitudes toward what Pit Bulls represent (e.g., violent crime, drugs), an ANOVA would be appropriate if attitudes were measured by continuous scale.

Clustering & Pattern Discovery Techniques

Content analysis is a way to understand what is written or depicted by counting phenomena in text (e.g., the umber of stories, the number of images, the number of unique nouns, etc.). For example, a researcher could use content analysis to assess the language surrounding sports coverage in various media outlets.

A text analysis method that, in an unsupervised way, discovers the semantic structures that underlie content. Topic models assume that within each ‘document,’ there are many topics and each word as a probability of being sorted into a topic.  A scholar might use this technique to uncover the implicit content categories in a narrative or interview as a window into the cultural milieu.

Used in: English, Linguistics,

The measurements we use are often imperfect. The goal behind Principal Components Analysis (PCA) is this: how many dimensions are there to the data? For example, a language development researcher might want to understand how phonemic awareness predicts reading ability. However, there may be four different scales of reading ability available, and, not knowing which is the best, the researcher uses all four. PCA uses correlations among measures to arrive at a reduced understanding of dimensions (‘principal components’) to the date. In our language development example, if the reading ability scales measure the same thing, they may all be part of the same dimension.