Overview
The query we want to return has the form
“entity_name AND (good_keyword1 OR good_keyword2 OR …) AND NOT bad_keyword1 AND NOT bad_keyword2 AND NOT …”
This query means we want each returned tweet to contain the entity name and at least one of the good keywords (combined using “OR” logic) and does not contain any of the bad keywords (combined using “AND NOT” logic). Our algorithm will find the list of good keywords and bad keywords to be used in the query.
“entity_name AND (good_keyword1 OR good_keyword2 OR …) AND NOT bad_keyword1 AND NOT bad_keyword2 AND NOT …”
This query means we want each returned tweet to contain the entity name and at least one of the good keywords (combined using “OR” logic) and does not contain any of the bad keywords (combined using “AND NOT” logic). Our algorithm will find the list of good keywords and bad keywords to be used in the query.
Stage 1
The main idea is to start with a query that returns a relatively pure set of tweets and keep expanding the query until we captured a good amount of relevant tweets. Thus, we start with one simple query, and we add more keywords to it iteratively. Every iteration, we run the new query to get tweets, and then classify tweets into “good”, “ambiguous” and “bad” categories, where “good” tweets are relevant tweets, “ambiguous” tweets are tweets we don’t know whether they are relevant or not and “bad” tweets are irrelevant tweets. Then we extract keywords from “good” tweets and “bad” tweets to expand our good keywords list and bad keywords list. We create a new query adding these newly extracted keywords.
Obtain Two Set of Tweets:
or each category of entities, we ask the user for a input, pre-defined keyword. The pre-defined keywords are the words that the user think most related to the true entity. We first run a query with just the entity name against the whole tweets database, and name the returned set of tweets T. Then we run a query with the entity name and the predefined keywords to get a set of tweets S. In the set S, every tweets contains the entity name and at least one of the pre-defined keywords for movie. Therefore we assume set S is a relatively pure set of tweets that are relevant to the movie Birdman. We can then extract some keywords we consider “good” from set S, and also extract “bad” keywords from set T-S, the complement set of S. The relationship of these two sets of keywords are shown in the plot below
or each category of entities, we ask the user for a input, pre-defined keyword. The pre-defined keywords are the words that the user think most related to the true entity. We first run a query with just the entity name against the whole tweets database, and name the returned set of tweets T. Then we run a query with the entity name and the predefined keywords to get a set of tweets S. In the set S, every tweets contains the entity name and at least one of the pre-defined keywords for movie. Therefore we assume set S is a relatively pure set of tweets that are relevant to the movie Birdman. We can then extract some keywords we consider “good” from set S, and also extract “bad” keywords from set T-S, the complement set of S. The relationship of these two sets of keywords are shown in the plot below
Stage 2
Keywords Extraction
Two sets of keywords are extracted though a existing NLP algorithm RAKE, rapid automatic keyword extraction. This algorithm will clean the each tweets by removing undesired tokens, such as "https:", " 've ", "link" etc. After cleaning, each words in the tweets are ranked by their frequency. The keywords extracted from "Good" tweets set are called "Good Keywords" and , on the other hand, the keywords extracted from "Bad" tweets set are called "Bad Keywords".
Initial Query Construction
We keep track of keywords used in query and keywords extracted, as we don’t want to include all the extracted keywords in the query all at once, since we cannot fully believe the keyword extraction algorithm will always return the optimum set of keywords we want. Our initial query then can be generated using part of these good and bad keywords. The number of top-ranked keywords used are then set as parameter of our system. Through experiments and testing, we find it is usually enough to use top three "good keywords" and top three "bad keywords" for the construction of initial query.
Two sets of keywords are extracted though a existing NLP algorithm RAKE, rapid automatic keyword extraction. This algorithm will clean the each tweets by removing undesired tokens, such as "https:", " 've ", "link" etc. After cleaning, each words in the tweets are ranked by their frequency. The keywords extracted from "Good" tweets set are called "Good Keywords" and , on the other hand, the keywords extracted from "Bad" tweets set are called "Bad Keywords".
Initial Query Construction
We keep track of keywords used in query and keywords extracted, as we don’t want to include all the extracted keywords in the query all at once, since we cannot fully believe the keyword extraction algorithm will always return the optimum set of keywords we want. Our initial query then can be generated using part of these good and bad keywords. The number of top-ranked keywords used are then set as parameter of our system. Through experiments and testing, we find it is usually enough to use top three "good keywords" and top three "bad keywords" for the construction of initial query.
Stage 3
Classification of Tweets
We run the initial query to get a resulting set of tweets. We use the complete lists of good and bad keywords to compare with each tweet and count how many good and bad keywords it has. Then calculate the ratio r of the number of good keywords against the number of bad keywords for each tweet. We also set up thresholds α and β. If r is less than α, we say it is a bad tweet. If r is between α and β, then we say it is an ambiguous tweet. Lastly if r is greater than β, then we believe it is a good tweet.
Keywords Extraction and Query Update
After we classify all the returned tweets, we can do another round of keywords extraction on the list of good tweets and the list of bad tweets. Similarly, we extract some number of keywords from each list, but then only part of them to the next query. The extracted keywords will be added to and kept track in the complete keywords sets, used for classifying tweets in the next round.
We run the initial query to get a resulting set of tweets. We use the complete lists of good and bad keywords to compare with each tweet and count how many good and bad keywords it has. Then calculate the ratio r of the number of good keywords against the number of bad keywords for each tweet. We also set up thresholds α and β. If r is less than α, we say it is a bad tweet. If r is between α and β, then we say it is an ambiguous tweet. Lastly if r is greater than β, then we believe it is a good tweet.
Keywords Extraction and Query Update
After we classify all the returned tweets, we can do another round of keywords extraction on the list of good tweets and the list of bad tweets. Similarly, we extract some number of keywords from each list, but then only part of them to the next query. The extracted keywords will be added to and kept track in the complete keywords sets, used for classifying tweets in the next round.