Actually, nm, I need to make some more changes.
I hope that all is well with you.
Geoff
Quoting ghumphr(a)fas.harvard.edu:
In recent years, political methodologists, have produced innumerable
automated
document classification systems. Many of these ignore word sequence
information, treating entire documents as mere collections of words. A subset
of these, including those based on the well-known Naive Bayes algorithm,
assume
that word frequencies are independent. In this paper, we investigate the
success
of such algorithms by comparing Wordscores, a Naive Bayes derivative, to
several
well-known algorithms and a new classification system based on vanilla
recursive
heirarchical Dirichlet-multinomial mixture models, pointing out avenues for
future advancement. Surprisingly, we find that its assumptions
notwithstanding,
Wordscores shows dramatically increased performance, comparable to some of
the
latest developments in document classification, at carrying out a small
number
of carefully selected classifications on meticuously arranged collections of
political documents, and discuss its use in practical applications.
Quoting Gary King <king(a)harvard.edu>du>:
this sounds good. i would go a little bit farther in explaining some of
the terms to political scientists who don't know what naive bayes anything
is.
Gary
On Mon, 24 Apr 2006, ghumphr(a)fas.harvard.edu wrote:
Geoff Humphreys and Chris Long
Classfying Political Documents
In recent years, political methodologists, have produced innumerable
automated
> document classification systems. Many of these systems, such as those
based
on
> the well-known Naive Bayes algorithm, treat each word as a distinct
entity,
ignoring
complex interactions between them. While for some applications
this
approach may appear reasonable, the precise
arrangements of words in
political
> documents often convey meanings which cannot be captured so easily. In
this
> paper, we investigate the success of such
naive algorithms by comparing
> Wordscores, a Naive Bayes derivative, to several well-known algorithms
and
a
> new classification system based on vanilla recursive heirarchical
> Dirichlet-multinomial mixture models, pointing out avenues for future
> advancement. Surprisingly, we find that the assumptions of Wordscores
> notwithstanding, it shows dramatically increased performance, comparable
to
> some of the latest developments in document
classification, at carrying
out
a
> small number of carefully selected classifications on meticuously
arranged
collections of political documents, and discuss its use in practical
applications.
_______________________________________________
gov2001-l mailing list
gov2001-l(a)lists.fas.harvard.edu
http://lists.fas.harvard.edu/mailman/listinfo/gov2001-l
_______________________________________________
gov2001-l mailing list
gov2001-l(a)lists.fas.harvard.edu
http://lists.fas.harvard.edu/mailman/listinfo/gov2001-l
_______________________________________________
gov2001-l mailing list
gov2001-l(a)lists.fas.harvard.edu
http://lists.fas.harvard.edu/mailman/listinfo/gov2001-l