| 
  • If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

View
 

Juan's Topic Modeling

Page history last edited by Juan Llamas-Rodriguez 10 years, 5 months ago

 

For my exercise I wanted to figure out if I could attempt a topic modeling of a film. Obviously for the modeling to work, the film would have to be translated into a text file. I thought of using either the screenplay or the close captioning subtitles as the document to input. I opted for the close captioning since they represent a kind of transcript of the dialogue and basic actions of the film, though I am open to discussing whether this was a better choice (and may end up doing a comparative topic modeling later to see what kind of results come up). 

 

The film I chose was Children of Men, in part to justify my use of the close captioning since it is a film that features a lot of expository dialogue (even if it's often from spoken advertisements). 

 

One of the things remarked by the readings was that choosing the number of topics to extract from a corpus is more of an art than a science, so I decided to play with this variable and run several tries for each number of topics. Here are some results:

 

TABLE 1 -- Topics:10 Try:1 Ordered from most to least prevalent

 

1 fuck back good protocol fish citizens gonna great disgusting days
2 money road happened time gotta didn dangerous live keys proper
3 shit uprising clean faster thing doesn infertility flag cows london
4 baby car girl immigrants stop day death public barn cops
5 papers years illegal diego supposed idea matter life coast telling
6 fucking make choice cousin wanted stork house bring place hour
7 theo gonna kee government care transit sake amigo work thinking
8 julian ll coming moving leader wheel bleeding wait ready activist
9 don safe human project stay police people world doctor fugee
10 tomorrow watching talk ago person patric shirt aid train bus

 

TABLE 2 -- Topics:10 Try:2 Ordered from most to least prevalent

 

1 back kee citizens coast bring patric grand britain immigrant man
2 theo make police protocol leader faster gotta god fugee coffee
3 fucking world money supposed pregnant barn clean shot great thing
4 years safe illegal shit transit idea didn telling house leave
5 project death uprising tomasz cops wanted dangerous infertility person keys
6 government ll care time find fish stork tits shirt aid
7 don girl good people choice blood moving rest real push
8 car gonna papers make matter happened day coming doesn watching
9 fuck julian human immigrants stay diego road days life public
10 baby proper doctor place tomorrow wait ringing amigo months killed

 

TABLE 3 -- Topics:5 Try:1 Ordered from most to least prevalent

 

1 fucking make safe immigrants police human diego road idea proper
2 theo years world project coming ll talk matter happened amigo
3 baby julian illegal stay money supposed time barn fish clean
4 girl fuck back papers care good shit sake day life
5 don car gonna kee government stop people cousin uprising tomorrow

 

TABLE 4 -- Topics:5 Try:3 Ordered from most to least prevalent

 

1 fucking make government ll care coming people road talk sake
2 theo girl years kee papers good day clean leader tomasz
3 fuck safe back world project immigrants illegal police transit cousin
4 don julian stop money diego supposed idea amigo life proper
5 baby car gonna human stay happened time days doctor public

 

*I established prevalence order by the number of documents in which that particular topic had the highest probability (i.e. 1 had the highest probability in the most number of documents)*

 

I'm trying to make sense of what kinds of variation occur given the number of topics and times the model is run. For now let me focus on the topic word immigrants, which is a main theme in the film and as such, along with the main characters' names (Theo, Julian, and Kee), should appear in all instances. Two peculiarities from looking at this instance: 

 

  1. Words are grouped together into topics by how often they would appear within the same document, so it's interesting to note how varied the associated words with "immigrants" are in just these four instances. The one exception seems to be that in most of these cases (and in others I didn't picture) "immigrants" shows up in the same topic as "fuck" or "fucking". 
  2. Given that topic modeling depends on probability of finding a particular set of topic words within a corpus, it seems a bit strange that in Table 3, the topic that includes "immigrants", "diego", and "road" is the most prevalent while in Table 2, the topic that includes these same words is near the bottom.

 

Comments (0)

You don't have permission to comment on this page.