Date of Award

8-25-2011

Embargo Period

1-17-2013

Degree Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Language Technologies Institute

Advisor(s)

Jaime Carbonell

Abstract

Social media collections are becoming increasingly important in the everyday life of Internet users. Recent statistics show that sites hosting social media and community-generated content account for five of the top ten most visited websites in the United States [4], are visited regularly by a broad cross-section of Internet users [61, 67, 115] and host an enormous quantity of information [119, 48, 9]. The increasing importance and size of these collections requires that information retrieval systems pay special attention to these collections, and in particular pay attention to those aspects of social media collections that set them apart from the general web.

Social media collections are interesting and challenging from the perspective of information retrieval systems. These collections are dynamic, with content being constantly added, removed and modified. These collections are time-sensitive, with the most recently added content often viewed as the most significant. These collections are richly structured, with authorship information, often threading structure and higher-level topical classifications. Although this type of collection structure is frequently critical for comprehension, it is rarely exploited in retrieval algorithms.

This thesis investigates the hypothesis that we can improve retrieval performance in these collections by leveraging this type of structure. To evaluate this hypothesis, we present an exploration of search in several social media collections: blogs and online forums. We demonstrate the utility of leveraging collection structure in three different retrieval tasks: blog post search, blog feed search, and forum thread search. The techniques explored throughout these experiments include evaluating the representation granularity of collections of documents, and methods to incorporate content an author has written throughout the collection. Our results show that, although the retrieval tasks and techniques to leverage this type of collection structure are varied, in many cases substantial and significant retrieval quality improvements can be realized by leveraging this collection structure.

Comments

CMU-LTI-11-014

Share

COinS