Date of Original Version

6-2011

Type

Conference Proceeding

Journal Title

Proceedings of the ACL Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LATECH)

First Page

115

Last Page

123

Rights Management

Copyright 2011 ACL

Abstract or Description

While the study of the connection between discourse patterns and personal identification is decades old, the study of these patterns using language technologies is relatively recent. In that more recent tradition we frame author age prediction from text as a regression problem. We explore the same task using three very different genres of data simultaneously: blogs, telephone conversations, and online forum posts. We employ a technique from domain adaptation that allows us to train a joint model involving all three corpora together as well as separately and analyze differences in predictive features across joint and corpusspecific aspects of the model. Effective features include both stylistic ones (such as POS patterns) as well as content oriented ones. Using a linear regression model based on shallow text features, we obtain correlations up to 0.74 and mean absolute errors between 4.1 and 6.8 years.

Creative Commons License

Creative Commons Attribution-Noncommercial-Share Alike 3.0 License
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.

Share

COinS
 

Published In

Proceedings of the ACL Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LATECH), 115-123.