[gov2001-l] Preliminary Abstract

24 Apr 2006

In recent years, political methodologists, have produced innumerable automated
document classification systems. Many of these ignore word sequence
information, treating entire documents as mere collections of words. A subset
of these, including those based on the well-known Naive Bayes algorithm, assume
that word frequencies are independent. In this paper, we investigate the success
of such algorithms by comparing Wordscores, a Naive Bayes derivative, to several
well-known algorithms and a new classification system based on vanilla recursive
heirarchical Dirichlet-multinomial mixture models, pointing out avenues for
future advancement. Surprisingly, we find that its assumptions notwithstanding,
Wordscores shows dramatically increased performance, comparable to some of the
latest developments in document classification, at carrying out a small number
of carefully selected classifications on meticuously arranged collections of
political documents, and discuss its use in practical applications.

Quoting Gary King &lt;king(a)harvard.edu&gt;du>:

...

 this sounds good.  i would go a little bit farther in explaining some of
 the terms to political scientists who don't know what naive bayes anything
 is.
 Gary

 On Mon, 24 Apr 2006, ghumphr(a)fas.harvard.edu wrote:

  Geoff Humphreys and Chris Long

 Classfying Political Documents

 In recent years, political methodologists, have produced innumerable  automated
  document classification systems. Many of these
systems, such as those based  on
  the well-known Naive Bayes algorithm, treat each
word as a distinct entity,
 ignoring complex interactions between them. While for some applications  this
  approach may appear reasonable, the precise
arrangements of words in  political
  documents often convey meanings which cannot be
captured so easily. In this
 paper, we investigate the success of such naive algorithms by comparing
 Wordscores, a Naive Bayes derivative, to several well-known algorithms and  a
  new classification system based on vanilla
recursive heirarchical
 Dirichlet-multinomial mixture models, pointing out avenues for future
 advancement. Surprisingly, we find that the assumptions of Wordscores
 notwithstanding, it shows dramatically increased performance, comparable to
 some of the latest developments in document classification, at carrying out  a
  small number of carefully selected
classifications on meticuously arranged
 collections of political documents, and discuss its use in practical
 applications.

 _______________________________________________
 gov2001-l mailing list
 gov2001-l(a)lists.fas.harvard.edu
 http://lists.fas.harvard.edu/mailman/listinfo/gov2001-l
  _______________________________________________
 gov2001-l mailing list
 gov2001-l(a)lists.fas.harvard.edu
 http://lists.fas.harvard.edu/mailman/listinfo/gov2001-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

[gov2001-l] Preliminary Abstract