Date of Original Version

Winter 12-22-2017



Abstract or Description

CMU-ISR-17-118 / CMU-LTI-17-010--Privacy policies notify Internet users about the privacy practices of websites, mobile apps, and other products and services. However, users rarely read them and struggle to understand their contents. Also, the entities that provide these policies are sometimes unmotivated to make them comprehensible. Recently, annotated corpora of privacy policies have been introduced to the research community. They open the door to the development of machine learning and natural language processing techniques to automate the annotation of these documents. In turn, these annotations can be passed on to interfaces (e.g., web browser plugins) that help users quickly identify and understand relevant privacy statements. We present advances in extracting privacy policy paragraphs (termed segments in this paper) and individual sentences that relate to expert-identified categories of policy contents, using methods in supervised learning. In particular, we show that relevant segments and sentences can be classified with average micro-F1 scores of 0.79 and 0.70 respectively, improving over prior work. We discuss how the techniques introduced in this paper have been used to automatically annotate the text of about 7,000 privacy policies. Our discussion highlights opportunities as well as limitations associated with our classification approach.