class: center middle main-title section-title-4 # Text .class-info[ **Session 14** .light[PMAP 8921: Data Visualization with R<br> Andrew Young School of Policy Studies<br> Summer 2025] ] --- name: outline class: title title-inv-7 # Plan for today -- .box-6.medium.sp-after[Qualitative text-based data] -- .box-5.medium.sp-after[Crash course in<br>computational linguistics] --- layout: false name: text-data class: center middle section-title section-title-6 animated fadeIn # Qualitative text-based data --- layout: true class: title title-6 --- # Free responses .center[ <figure> <img src="img/14/free-responses.png" alt="Free responses from a survey" title="Free responses from a survey" width="70%"> <figcaption>Typical free responses from a survey</figcaption> </figure> ] --- # y tho? .center[ <figure> <img src="img/14/word-cloud.png" alt="Bad word cloud" title="Bad word cloud" width="60%"> </figure> ] --- # Some cases are okay .center[ <figure> <img src="img/14/email-word-cloud.jpg" alt="What Happened? word cloud" title="What Happened? word cloud" width="45%"> </figure> ] ??? https://twitter.com/s_soroka/status/907941270735278085 --- # Word clouds for grownups .box-inv-6[Count words, but in fancier ways] .center[ <figure> <img src="img/14/cover.png" alt="Tidy text mining with R" title="Tidy text mining with R" width="30%"> </figure> ] ??? https://www.tidytextmining.com/ --- layout: false class: bg-full background-image: url("img/14/he-she-julia.png") ??? https://pudding.cool/2017/08/screen-direction/ --- layout: false class: bg-90 background-image: url("img/14/minimap-1.png") ??? https://juliasilge.com/blog/song-lyrics-across/ --- layout: false name: computational-linguistics class: center middle section-title section-title-5 animated fadeIn # Crash course in<br>computational linguistics --- layout: true class: title title-5 --- # Core concepts and techniques -- .box-inv-5[Tokens, lemmas, and parts of speech] -- .box-inv-5[Sentiment analysis] -- .box-inv-5[tf-idf] -- .box-inv-5[Topics and LDA] -- .box-inv-5[Fingerprinting] --- # Regular text .small-code[ ``` THE BOY WHO LIVED Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold with such nonsense. Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy man with hardly any neck, although he did have a very large mustache. Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors. The Dursleys had a small son called Dudley and in their opinion there was no finer boy anywhere. The Dursleys had everything they wanted, but they also had a secret, and their greatest fear was that somebody would discover it. They didn't think they could bear it if anyone found out about the Potters. Mrs. Potter was Mrs. Dursley's sister, but they hadn't met for several years; in fact, Mrs. Dursley pretended she didn't have a sister, because her sister and her good-for-nothing husband were as unDursleyish as it was possible to be. The Dursleys shuddered to think what the neighbors would say if the Potters a... ``` ] --- # Tidy text .box-inv-5[One row for each text element] .box-5.small[Can be chapter, page, verse, etc.] .small-code[ ``` [38;5;246m# A tibble: 6 × 3[39m chapter book text [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<chr>[39m[23m [38;5;250m1[39m 1 Harry Potter and the Philosopher's Stone [38;5;246m"[39mTHE BOY WHO LIVED Mr. and Mrs. Dursl… [38;5;250m2[39m 2 Harry Potter and the Philosopher's Stone [38;5;246m"[39mTHE VANISHING GLASS Nearly ten years… [38;5;250m3[39m 3 Harry Potter and the Philosopher's Stone [38;5;246m"[39mTHE LETTERS FROM NO ONE The escape o… [38;5;250m4[39m 4 Harry Potter and the Philosopher's Stone [38;5;246m"[39mTHE KEEPER OF THE KEYS BOOM. They kn… [38;5;250m5[39m 5 Harry Potter and the Philosopher's Stone [38;5;246m"[39mDIAGON ALLEY Harry woke early the ne… [38;5;250m6[39m 6 Harry Potter and the Philosopher's Stone [38;5;246m"[39mTHE JOURNEY FROM PLATFORM NINE AND TH… ``` ] --- # Tokens .box-inv-5[Split the text into even smaller parts] .box-5.small[Paragraph, line, verse, sentence, n-gram, word, letter, etc.] .pull-left.small-code[ ``` [38;5;246m# A tibble: 6 × 3[39m word chapter book [3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<chr>[39m[23m [38;5;250m1[39m the 1 Harry Potter... [38;5;250m2[39m boy 1 Harry Potter... [38;5;250m3[39m who 1 Harry Potter... [38;5;250m4[39m lived 1 Harry Potter... [38;5;250m5[39m mr 1 Harry Potter... [38;5;250m6[39m and 1 Harry Potter... ``` ] -- .pull-right.small-code[ ``` [38;5;246m# A tibble: 6 × 3[39m bigram chapter book [3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<chr>[39m[23m [38;5;250m1[39m the boy 1 Harry Potter... [38;5;250m2[39m boy who 1 Harry Potter... [38;5;250m3[39m who lived 1 Harry Potter... [38;5;250m4[39m lived mr 1 Harry Potter... [38;5;250m5[39m mr and 1 Harry Potter... [38;5;250m6[39m and mrs 1 Harry Potter... ``` ] --- # Stop words .box-inv-5[Common words that we can generally ignore] .center.small-code[ ``` [38;5;246m# A tibble: 1,149 × 2[39m word lexicon [3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<chr>[39m[23m [38;5;250m 1[39m a SMART [38;5;250m 2[39m a's SMART [38;5;250m 3[39m able SMART [38;5;250m 4[39m about SMART [38;5;250m 5[39m above SMART [38;5;250m 6[39m according SMART [38;5;250m 7[39m accordingly SMART [38;5;250m 8[39m across SMART [38;5;250m 9[39m actually SMART [38;5;250m10[39m after SMART [38;5;246m# ℹ 1,139 more rows[39m ``` ] --- # Token frequency: words <img src="14-slides_files/figure-html/hp-words-1.png" width="100%" style="display: block; margin: auto;" /> --- # Token frequency: n-grams <img src="14-slides_files/figure-html/hp-bigrams-1.png" width="100%" style="display: block; margin: auto;" /> --- # Token frequency: n-gram ratios <img src="14-slides_files/figure-html/hp-se-she-1.png" width="100%" style="display: block; margin: auto;" /> --- # Parts of speech .small-code[ ``` [38;5;246m# A tibble: 50 × 11[39m doc_id sid tid token token_with_ws lemma upos xpos feats tid_source relation [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<chr>[39m[23m [38;5;250m 1[39m 1 1 1 THE THE the DET DT Defin… 2 det [38;5;250m 2[39m 1 1 2 BOY BOY Boy NOUN NN Numbe… 18 nsubj [38;5;250m 3[39m 1 1 3 WHO WHO who PRON WP PronT… 4 nsubj [38;5;250m 4[39m 1 1 4 LIVED LIVED live VERB VBD Mood=… 2 acl:rel… [38;5;250m 5[39m 1 1 5 Mr. Mr. Mr. PROPN NNP Numbe… 4 xcomp [38;5;250m 6[39m 1 1 6 and and and CCONJ CC [31mNA[39m 7 cc [38;5;250m 7[39m 1 1 7 Mrs. Mrs. Mrs. PROPN NNP Numbe… 5 conj [38;5;250m 8[39m 1 1 8 Dursley Dursley Dursley PROPN NNP Numbe… 7 flat [38;5;250m 9[39m 1 1 9 , , , PUNCT , [31mNA[39m 5 punct [38;5;250m10[39m 1 1 10 of of of ADP IN [31mNA[39m 11 case [38;5;246m# ℹ 40 more rows[39m ``` ] .box-inv-5.small[These use the [Penn part of speech tags](https://cs.nyu.edu/grishman/jet/guide/PennPOS.html)] --- # Parts of speech frequency .pull-left-3.small-code[ .box-inv-5.small[Verbs] ``` [38;5;246m# A tibble: 1,557 × 2[39m lemma n [3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<dbl>[39m[23m [38;5;250m 1[39m say 920 [38;5;250m 2[39m get 440 [38;5;250m 3[39m have 417 [38;5;250m 4[39m go 384 [38;5;250m 5[39m look 380 [38;5;250m 6[39m be 310 [38;5;250m 7[39m know 310 [38;5;250m 8[39m see 303 [38;5;250m 9[39m think 230 [38;5;250m10[39m do 227 [38;5;246m# ℹ 1,547 more rows[39m ``` ] -- .pull-middle-3.small-code[ .box-inv-5.small[Nouns] ``` [38;5;246m# A tibble: 2,852 × 2[39m lemma n [3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<dbl>[39m[23m [38;5;250m 1[39m Harry [4m1[24m315 [38;5;250m 2[39m Ron 423 [38;5;250m 3[39m Hagrid 258 [38;5;250m 4[39m Professor 167 [38;5;250m 5[39m Snape 154 [38;5;250m 6[39m Hermione 153 [38;5;250m 7[39m Dumbledore 144 [38;5;250m 8[39m time 138 [38;5;250m 9[39m Dudley 136 [38;5;250m10[39m uncle 122 [38;5;246m# ℹ 2,842 more rows[39m ``` ] -- .pull-right-3.small-code[ .box-inv-5.small[Adjectives & adverbs] ``` [38;5;246m# A tibble: 1,240 × 2[39m lemma n [3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<dbl>[39m[23m [38;5;250m 1[39m back 223 [38;5;250m 2[39m so 215 [38;5;250m 3[39m just 180 [38;5;250m 4[39m when 178 [38;5;250m 5[39m very 171 [38;5;250m 6[39m now 166 [38;5;250m 7[39m then 165 [38;5;250m 8[39m all 147 [38;5;250m 9[39m how 136 [38;5;250m10[39m there 123 [38;5;246m# ℹ 1,230 more rows[39m ``` ] --- # Artsy stuff .center[ <figure> <img src="img/14/closeup.jpg" alt="Alice in Wonderland punctuation by Nicholas Rougeux" title="Alice in Wonderland punctuation by Nicholas Rougeux" width="100%"> </figure> ] ??? [*Between the Words*](https://www.c82.net/work/?id=347) by Nicholas Rougeux --- # Sentiment analysis .pull-left-3.small-code[ ``` r get_sentiments("bing") ``` ``` [38;5;246m# A tibble: 6,786 × 2[39m word sentiment [3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<chr>[39m[23m [38;5;250m 1[39m 2-faces negative [38;5;250m 2[39m abnormal negative [38;5;250m 3[39m abolish negative [38;5;250m 4[39m abominable negative [38;5;250m 5[39m abominably negative [38;5;250m 6[39m abominate negative [38;5;250m 7[39m abomination negative [38;5;250m 8[39m abort negative [38;5;250m 9[39m aborted negative [38;5;250m10[39m aborts negative [38;5;246m# ℹ 6,776 more rows[39m ``` ] -- .pull-middle-3.small-code[ ``` r get_sentiments("afinn") ``` ``` [38;5;246m# A tibble: 2,477 × 2[39m word value [3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<dbl>[39m[23m [38;5;250m 1[39m abandon -[31m2[39m [38;5;250m 2[39m abandoned -[31m2[39m [38;5;250m 3[39m abandons -[31m2[39m [38;5;250m 4[39m abducted -[31m2[39m [38;5;250m 5[39m abduction -[31m2[39m [38;5;250m 6[39m abductions -[31m2[39m [38;5;250m 7[39m abhor -[31m3[39m [38;5;250m 8[39m abhorred -[31m3[39m [38;5;250m 9[39m abhorrent -[31m3[39m [38;5;250m10[39m abhors -[31m3[39m [38;5;246m# ℹ 2,467 more rows[39m ``` ] -- .pull-right-3.small-code[ ``` r get_sentiments("nrc") ``` ``` [38;5;246m# A tibble: 13,872 × 2[39m word sentiment [3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<chr>[39m[23m [38;5;250m 1[39m abacus trust [38;5;250m 2[39m abandon fear [38;5;250m 3[39m abandon negative [38;5;250m 4[39m abandon sadness [38;5;250m 5[39m abandoned anger [38;5;250m 6[39m abandoned fear [38;5;250m 7[39m abandoned negative [38;5;250m 8[39m abandoned sadness [38;5;250m 9[39m abandonment anger [38;5;250m10[39m abandonment fear [38;5;246m# ℹ 13,862 more rows[39m ``` ] --- <img src="14-slides_files/figure-html/hp-net-sentiment-1.png" width="100%" style="display: block; margin: auto;" /> --- # tf-idf .box-inv-5[Term frequency-inverse document frequency] .box-5.small[How important a term is compared to the rest of the documents] $$ `\begin{aligned} tf &= \frac{n_{\text{term}}}{n_{\text{terms in document}}} \\ idf(\text{term}) &= \ln{\left(\frac{n_{\text{documents}}}{n_{\text{documents containing term}}}\right)} \\ tf\text{-}idf(\text{term}) &= tf(\text{term}) \times idf(\text{term}) \end{aligned}` $$ --- # tf-idf <img src="14-slides_files/figure-html/hp-tf-idf-1.png" width="100%" style="display: block; margin: auto;" /> --- # Topic modeling .pull-left.center[ <figure> <img src="img/14/laurel-thatcher-ulrich.jpg" alt="Laurel Thatcher Ulrich" title="Laurel Thatcher Ulrich" width="55%"> </figure> ] .pull-right.center[ <figure> <img src="img/14/midwifes-tale.jpg" alt="A Midwife's Tale" title="A Midwife's Tale" width="55%"> </figure> ] ??? https://commons.wikimedia.org/wiki/File:Laurel_Thatcher_Ulrich_(32803708014).jpg https://commons.wikimedia.org/wiki/File:A_Midwife%27s_Tale_by_Laurel_Thatcher_Ulrich.jpg --- # Latent Dirichlet Allocation (LDA) .center[ <figure> <img src="img/14/LDA.png" alt="Latent Dirichlet Allocation" title="Latent Dirichlet Allocation" width="80%"> </figure> ] --- # Clusters of related words <table> <tr> <th class="cell-left">Topic label</th> <th class="cell-left">Topic words</th> </tr> <tr> <td class="cell-left">Midwifery</td> <td class="cell-left">birth safe morn receivd calld left cleverly pm labour …</td> </tr> <tr> <td class="cell-left">Church</td> <td class="cell-left">meeting attended afternoon reverend worship …</td> </tr> <tr> <td class="cell-left">Death</td> <td class="cell-left">day yesterday informd morn years death expired …</td> </tr> <tr> <td class="cell-left">Gardening</td> <td class="cell-left">gardin sett worked clear beens corn warm planted …</td> </tr> <tr> <td class="cell-left">Shopping</td> <td class="cell-left">lb made brot bot tea butter sugar carried …</td> </tr> <tr> <td class="cell-left">Illness</td> <td class="cell-left">unwell sick gave dr rainy easier care head neighbor …</td> </tr> </table> ??? https://www.cameronblevins.org/posts/topic-modeling-martha-ballards-diary/ http://journalofdigitalhumanities.org/2-1/topic-modeling-a-basic-introduction-by-megan-r-brett/ --- # Track topics over time .pull-left.center[ <figure> <img src="img/14/coldweatherbymonth.png" alt="Cold weather by month" title="Cold weather by month" width="100%"> <figcaption>Cold weather topic by month</figcaption> </figure> ] -- .pull-right.center[ <figure> <img src="img/14/emotionbyyear.png" alt="Emotion by year" title="Emotion by year" width="100%"> <figcaption>Emotion topic over time</figcaption> </figure> ] ??? https://www.cameronblevins.org/posts/topic-modeling-martha-ballards-diary/ --- # State of the Union addresses .center[ <figure> <img src="img/14/sotu.png" alt="State of the Union topics over time" title="State of the Union topics over time" width="37%"> </figure> ] ??? https://cran.r-project.org/web/packages/cleanNLP/vignettes/state-of-union.html --- # Fingerprinting .box-inv-5[Analyze richness or uniqueness of a document] .box-5[Punctuation patterns, vocabulary choices, sentence length] .box-5[Hapax legomenon] --- # Sentence length .center[ <figure> <img src="img/14/fingerprint-sentence.png" alt="Sentence length heatmaps" title="Sentence length heatmaps" width="75%"> </figure> ] ??? https://kops.uni-konstanz.de/bitstream/handle/123456789/5492/Literature_Fingerprinting.pdf --- # Hapax legomena .center[ <figure> <img src="img/14/fingerprint-hapax.png" alt="Hapax heatmaps" title="Hapax heatmaps" width="75%"> </figure> ] --- # Verse length .center[ <figure> <img src="img/14/fingerprint-verse.png" alt="Verse length heatmaps" title="Verse length heatmaps" width="40%"> </figure> ]