Functional Grammar

13th ICFG 2008

Abstracts

13th International Conference on Functional Grammar

Applying Machine Learning and Functional Grammar to Web Page Classification

Simon Courtenage,
University of Westminster, London, Great Britain

Understanding the (textual) content of a web page and deciding which topic or subject it applies to is easy for humans.It is much more difficult for machines. Without the benefit of language comprehension, machines (or rather the software they run) can only hope to recognise patterns of word collections and relate them to patterns they already know about. The problem of deciding what topic a piece of natural language text is about reduces then to deciding how close a pattern of words is to a pattern or set of patterns that the machine knows are related to a particular topic.

In this paper we discuss the problem of text classification using machine learning as it relates to the particular problem of web page classification, where the aim is, given a number of predefined categories or topics, to decide which category a web page belongs to on the basis of its (textual) content. This is an on-going research area within the web search and machine learning research communities, and is useful for search engines that want to improve their search results and/or advertising.

Although many solutions have been proposed in connection with automated web page classification [Qi & Davidson, 2007] (and the more general area of text classification), these have mostly depended on identifying vocabularies that define a category and then attempting to fit the words used in a web page with the vocabulary of a category. For example, we may decide that the words “Java”, “software”, “program”, “bean”, “enterprise” and “development” form a vocabulary that define the topic of writing software in the Java programming language, while the vocabulary “Java”, “bean”, “coffee”, “cup”, “sugar”, “ground” is a vocabulary that defines the topic of Java coffee.In order to decide which category a web page fits into, we then look at how closely the words in the page fit each vocabulary. The closer the degree of fit, the more confident we are that the page belongs to that category.Very little research, however, has considered the way in which words are used in the page in order to improve the classification results.

We demonstrate how Functional Grammar (FG) can be applied to the problem of improving classification results of such automated methods. We concentrate on supervised machine learning methods that depend explicitly on word usage patterns, for example, Support Vector Machines (SVMs) [Joachims, 2002], which specify particular vocabularies for use in detecting word usage patterns.These methods are supervised in that they are first trained to recognise particular word patterns, using a vocabulary and example sets of web pages. Conventional use ofSVMs only consider word occurrence, rather than their placement or use. We show how information derived from an FG representation of text can be used to improve the training and use of SVMs and hence improve its classification performance.

Back to Programme

References:

Qi, Xiaoguang and Brian Davidson, Brian. 2007. Web Page Classification: Features and Algorithms, Technical Report LU-CSE-07-010, LeHigh University.
Joachims, Thorsten. 2002. Learning to Classify Text Using Support Vector Machines, Kluwer.