I find myself working on a paper related to analyse YouTube videos, and for that reason, I’ve been looking into different methodologies that can be applied. Sometime ago, I found a book from Mike Thelwall that explains the Word Association Thematic Analysis as a way to detect differences in word usage between two groups of texts. Today, I read the article “Word Association Thematic Analysis: Insight Discovery from the Social Web” because I thought it could be helpful, next are some insights from it =)
Thelwall, M. (2023). Word Association Thematic Analysis: Insight Discovery from the Social Web. SN Computer Science, 4(6), 827.
https://doi.org/10.1007/s42979-023-02289-9
What is WATA?
If you work with social media data or any kind of large text collection, you know how tricky it can be to make sense of thousands or even millions of comments. Traditionally, thematic analysis is great for depth, but it doesn’t scale easily as it is time consuming. On the other hand, topic modeling can scale well, but can be difficult to interpret or give context.
For that WATA was created, as it is a mixed-method approach that combines statistics with human interpretation in order to find and explain differences between two groups of texts (e.g., men vs women, before vs after, topic A vs topic B).
Its key insight: rather than clustering all texts into topics (as topic modeling does), focuses first on differences between two predefined groups, then interpret those differences thematically.
By doing so, WATA capitalizes on the strengths of both quantitative and qualitative methods:
- Quantitative: identifies words that are statistically overrepresented in one group vs. another
- Qualitative: examines how those words are used in context, and groups them into coherent themes
- Interpretive: generates a narrative of how the two groups diverge in discourse
This makes it especially useful when your primary research interest is in contrast rather than describing a single corpus in isolation.
How it works?
- Collecting the Texts: Text data can be gathered from social media platforms, YouTube comments, or other sources (such as abstracts from papers). The paper uses Mozdeh as the tool for extraction, but any suitable tool or corpus can be used. One key point is that this method requires a large dataset (thousands, ideally tens of thousands, of texts to produce statistically significant results.
- Splitting into Groups: the dataset needs to be divided in two groups (A and B) according to the comparison interest (e.g., male vs female author, pre- vs post-event, positive vs negative sentiment). Texts that do not clearly belong to either group are excluded (remainder group).
- WAD (Word Association Detection): this step identifies words that are significantly more frequent and with the largest percentage difference between the sets. If using Mozdeh’s Mine Associations button, it applies first chi-squared test to rank the words in statistical significance, then applies Benjamini-Hochberg for correction.
- WAC (Word Association Contextualisation): this is the qualitative work. For each statistically significant word, a researcher must read a sample of texts (~10–40) to understand the context of those words. This helps the analyst to move beyond surface-level word frequency to contextual meaning (e.g., bank as financial institution vs. riberbank).
- TA (Thematic Analysis): finally, the contexts are grouped into broader themes for each Group. This step is ideally done by several researchers for reliability. The result is a list of meaningful, evidence-based themes that explain how the groups differ.
Examples from the Article:
- Bulling on YouTube: from 4.6 million comments, the method found 1.000 bullying related words (WAD stage), and 12 themes (WAC & TA stage) related to bullying experiences and support for victims. Offering one new insight: support was focused on separating the victim’s identity from the abuse (“it’s not your fault”).
- “My ADHD” on Twitter: after analysing 59.000 tweets mentioning “my ADHD” vs. 99 other conditions, 19 themes were found (e.g., medication, diagnosis, etc. The new insight was how people talked about their brain as a separate entity from themselves.
- Gender in US academic papers: comparing male vs. female authors revealed big differences in topics like “Mothers” (15x more common among women) and “Engine components” (16x more common among men”).
Comparison with other methods:
| Method | Strengths | Weaknesses |
|---|---|---|
| Content Analysis | Structured, clear category counts | Labor-intensive coding |
| Thematic Analysis | Deep qualitative insight | Limited scalability |
| Topic Modeling | Fast, automatic, good for topic discovery | Limited interpretability, not designed for comparisons |
| WATA | Combines scale with human meaning; focuses on differences | Requires large datasets; manual interpretation remains essential |
Conclusion
If I have understood everything correctly (please read the paper to confirm and ease my mind), WATA provides a structured, hybrid framework that can help to uncover and interpret thematic differences between two sets of texts. It combines statistical analysis with qualitative insight, and produces results both scalable and meaningful. The most important point, is that it’s not just about counting words, it’s about understanding what those words mean in its real context and how two groups of people talk differently.
See you in the next paper =)