| Your Name | Project Title | Project Idea or Outline |
| | Semantic Ontology Learning | Learn/extract an ontology from Wikipedia (see Ponzetto & Strube, 2007) and maybe use it to augment WordNet (see Suchanek, Kasneci & Weikum, 2007) |
| | Large Text Corpus | I have a collection of 115,000 full length books from the Internet Archive. For each, I also have the call number, which identifies the primary subject of the book. A number of languages are present, the most common being English and then German. In the next several entries, I will describe several projects using this dataset that would be helpful for my research. I would be delighted to make the corpus available to other groups with interesting uses for it, so brainstorm away! |
| | Language Normalization | Many books in the corpus are very old: the old typefaces result in more noise from the OCR process and the spelling and morphology are very different, in a time-dependent way. Can we identify the age of each book and preprocess it so that it resembles modern english and can be parsed with modern parsers? Alternatively, can we use the data to train a parser that can a wider range of English? |
| | Book segmentation | Given a book from above collection, identify the various structural elements, e.g. where the main text starts and ends, chapter breaks. Also identify page breaks (page number and running headings at top of page, footnotes). Also very useful to identify title, chapter title, index, table of contents. |
| | Topic identification | Given a passage from a book, identify the key topics. |
| | Search Query Understanding | Context: web search, ranking (ordering) retrieved documents so that most useful ones are first. Naive NLP techniques are not useful, but we have evidence that natural language characteristics of the query are nevertheless essential to rank the results correctly. Useful approaches may be to identify/classify the main topic of the query and relationships among query terms. |
| Karan Neha Deodhar Vighnesh Harikrishna Samantha, Gagan | Lyric-based Music Recommendation | Context: To have a system which recommends songs to a user based on lyrics. The recommendation should be contextually correct (someone on a long drive looking for an appropriate "road" song should not get a song which is about the singer's long "road" to hell), and should also match the sentiment (user who is in an upbeat celebratory mood searches for a "party" song, gets a song talking about how the depressed singer's life "ain't no party"). Examples of application of this system include personal song dedication, commercial TV programmers looking for relevant background music to their content, and more. |
| <>, Gagan, Anushree | Extension of Karan's Idea | Karan's idea sounds very interesting! My idea: since an artist/band's songs tend to be consistent with their overall theme(s), we could extend the recommendation mechanism beyond lyrics to also consider information about artists/bands. We could possibly draw this information from sources like Wikipedia or CDBaby.com. Doing a search for Metallica in Wikipedia, for example, describes how their music has a fast tempo and is considered "thrash metal," which might suggest additional things about their music. This information might be helpful in determining if a particular song is appropriate for a given context. |
| Chien-Ming Steven Bryan Alex Travis | Article summarization | Give an article and summarize it with one or two sentences. In the first beginning, the idea is from that sometimes it takes a long time to read all the passages from friends' blogs daily. Hopefully all the passages can be summarized. We then read the summarizations first if find that is interesting then go into and read it. I am thinking about if the system can be web-based and combined with RSS feature. That is each time we only receive summarizations of subscribed sources. And this can also be applied to News, email, and so on. One other extension is the system might set priority based on summarizations, so we can read (handle (email)) it with the right ordering. |
| Aniket | QA system | The ultimate goal is an automatic "Yahoo! Answer". Ask a question and give you answers related to it by googling (or yahooing, whatever). However, this might be a little bit too ambitious to be a term project. We can narrow down the scope by only focusing on a special domain. For example, give a large amount of papers in bioinformatics or health related. Ask a question like what gene (or some other facts) is related to a specific symptom and find the answers through the papers. You can find more information in BioCreative. And Here is the TREC QA track, you might get more ideas. |
| Madhav, Sneha, Abhinav, Aniket | Blog classification | A system that can classify blogs based on contents. The idea I have is not quite well-defined as to what categories (suggestions welcome !). But a system which can automatically tag blogs by identifying the meaning of their content, so that readers can go ahead and choose the material they want to read. |
| Anupreet | Automated Blog Tag Recommendation and Clustering | Something similar to Madhav's idea but I am planning to use the collaboration and ensemble learning to do the same. | |
| | Abbreviating Emails | Using linguistic analysis to abbreviate an email message so that it can be displayed on a cell phone |
| Nihar, Ajay, Nitin, Purna | Sentiment Analysis on Customer Survey data | Using various natural language processing techniques to classify customer surveys.An extension to this would be to actually extract specific oriented(+/-)opinions customers might have about specific products or specific service they acquired.An even more challenging aspect of the project would to make the activity independent of the domain .This might be a ambitious project depending on the depth of the algorithm and features one chooses to implement but the scope could be reduced. |
| Aniket | Stock Market Prediction | Use NLP to detect and extract financial data published across newspapers and magazines. With this data, run statistical models to predict the financial markets for the next day/week |
| Madhav, Abhinav, Sneha | Story Understanding | Use NLP-based techniques for story understanding : Various ideas include: 1. Answering questions about characters, relations, role of the characters, etc. 2. Completing a story ( ambitious ? ), 3. Creating a system that can play a game of completing a story with inputs from a human, 4. Clustering stories based on likeness of characters, underlying themes, morals, etc. |
| Mugdha, Nitisha, Aninda | Movie Recommendation System | Given a movie or a genre or a theme, recommend other movies |
| Alejandro Dominguez | Conversational NLP | I have a extracted a fair amount of information from a health based forum, storing threads, replies, user, subforms and so on. The data includes 64,000+ posts, 187,000+ replies and 36,000+ users. Given this information, a number of NLP algorithm could be tried; for example it could be used to create/expand an ontology, extract features and so on. The challenge is that the language is highly informal and the class space is huge. |