|
13th International Conference
on Functional Grammar
|
Applying Machine Learning and Functional
Grammar to Web Page Classification |
Simon Courtenage,
University of Westminster, London, Great
Britain
Understanding
the (textual) content of a web page and deciding which topic
or subject it applies to is easy for humans.It is much more difficult for machines.
Without the benefit of language comprehension, machines
(or rather the software they run) can only hope to recognise
patterns of word collections and relate them to patterns they
already know about. The problem of deciding what topic a piece of natural
language text is about reduces then to deciding how close a
pattern of words is to a pattern or set of patterns that the
machine knows are related to a particular topic.
In this paper we discuss the problem of text classification using
machine learning as it relates to the particular problem of
web page classification, where the aim is, given a number of
predefined categories or topics, to decide which category a
web page belongs to on the basis of its (textual) content.
This is an on-going research area within the web search
and machine learning research communities, and is useful for
search engines that want to improve their search results and/or
advertising.
Although many solutions have been proposed in connection with automated
web page classification [Qi & Davidson, 2007] (and the
more general area of text classification), these have mostly
depended on identifying vocabularies that define a category
and then attempting to fit the words used in a web page with
the vocabulary of a category. For example, we may decide that the words “Java”,
“software”, “program”, “bean”, “enterprise”
and “development” form a vocabulary that define the topic
of writing software in the Java programming language, while
the vocabulary “Java”, “bean”, “coffee”,
“cup”, “sugar”, “ground” is a vocabulary that
defines the topic of Java coffee.In order to decide which category a web page fits into,
we then look at how closely the words in the page fit each
vocabulary. The
closer the degree of fit, the more confident we are that the
page belongs to that category.Very little research, however, has considered the way
in which words are used in the page in order to improve the
classification results.
We demonstrate how Functional Grammar (FG) can be applied to the problem
of improving classification results of such automated methods.
We concentrate on supervised machine learning methods
that depend explicitly on word usage patterns, for example,
Support Vector Machines (SVMs) [Joachims, 2002], which specify
particular vocabularies for use in detecting word usage
patterns.These
methods are supervised in that they are first trained
to recognise particular word patterns, using a vocabulary and
example sets of web pages. Conventional use ofSVMs only consider word occurrence, rather than their
placement or use. We
show how information derived from an FG representation of text
can be used to improve the training and use of SVMs and hence
improve its classification performance.
|
|
|
References: |
-
Qi, Xiaoguang and Brian Davidson, Brian. 2007. Web
Page Classification: Features and Algorithms,
Technical Report LU-CSE-07-010, LeHigh University.
-
Joachims, Thorsten. 2002. Learning to Classify Text Using Support Vector Machines,
Kluwer.
|
|
|