How to apply Mask and Reject terms to the dataset?

Home/How to apply Mask and Reject terms to the dataset?

Applying Mask and Reject Terms

Mask and Reject Terms

Mask and Reject terms are words or phrases (within quotes) separated by a space that are to be excluded and/or removed from the text data that you have created from social media sources or in the uploaded  bring your own data.

What are Mask terms?

Mask terms are words or phrases that are typically known to us in advance in our analysis (present in the text data) and therefore the objective is that we don’t want them to be captured as a part of the insights pertaining to the context analysis. For example, if we were to do a brand analysis of Pepsi using the twitter hashtag #pepsi, we intuitively know that a lot of the user conversations (tweet texts) around the Pepsi brand is likely to contain commonly known words at a higher frequency such as pepsi #pepsi drink bottle so on. So, when you specify them as the mask terms, these words are masked or concealed from the entire text (sentence or a paragraph).

Other examples:

a) Lets say you want to analyse the Consmer DNA of people voicing their opinions around some of the popular Pizza brands like PizzaHut, Dominos etc. Some of the common, high frequency terms to expect are: pizzahut pizza pizzas pizza’s dominos shop domino’s food “fast food” etc.

b) Location Analysis: country, country name like canada, city, cities, city name like “new york”, people, citizen etc

How to Specify Mask terms?

From the Dataset creation page, You can simply specify single words or phrases within quotes (e.g. “fast food”) seperated by a space. Donot use comma or semi-colon as a term seperator.

What If you donot know or forgot to specify Mask terms?

It’s possible that you may not know in advance about commonly used language terms in a specific context or might have forgotten to specify them
when you ran the dataset job. To help you in this situation, we have a “View/Filter” clean up tool for the dataset. Here is how to use it

  1.  Once you have created the dataset, Click on the Actions => View/Filter button.
  2. This will open up a window and show a list of “Top Single word occurance”, ‘Top Two Word Occurance” and then “Low Distribution Two Words..”
  3. The “Top Single Words” list is sorted by the frequency of the words present in the data. If you find common, known words in the list, click on the mask button to “Add to” the mask list, Then click on the “Clean up” button. You can then confirm and run the clean up job. The clean up job will mask the words you have selected from the entire dataset.
  4. You can similarly review words present in the other two lists (two word occurrence and low distribution) to mask the term. In two words (phrases). you may choose to mask the right or left word and/or the entire phrase.
  5. You can run the “View/Filter” job multiple times on a dataset. Typically multiple iterations may be needed in a noisy dataset. However, be careful that you select only the right words to filter and not to lose any crucial information.

What are Reject Terms?

Reject terms are words or phrases in the dataset that are irrelevant to the analysis context. Hence a user commentary containing such terms must be rejected or removed from the dataset. For example, conversations around a popular soap brand on its official Facebook may contain  commentaries about “happy mothers day” or might contain marketing campaigns running contests or offering promotional vouchers for instance. So, when you specify reject terms, any sentence or commentary containing such words or phrases is removed from the dataset entry.

For example, lets say there is a marketing comment “Participate in the contest before 25,Dec to win amazing prizes!”. Specifying participate or contest or prizes as the reject term will reject (remove) this entire commentary from the dataset.

How to Specify Reject terms?

From the Dataset creation page, You can simply specify single words or phrases within quotes (e.g. “latest contest”) separated by a space. Don’t use comma or semi-colon as a term separator.

Sample reject terms list

Please find below a list of reject words and phrases (space separated) that you may review, copy and apply to your scenario.

“why they” “You’re so” “you’re making” “making your” “your face” “are you” visit visiting visited “watch your” tweet “earth day” “the public” “show this” “what keeps” “sharing your” “share your” “you’re really” “white house” “wonder why” “why am” “how does” “the comment” “my business” “you make” “they said” “can’t you” “love you” dating “my date” “your friend” companies “pay attention” interview interviews “your brought” news “our site” “make plans” “save” “you save” “you buy” “signed up” “sign up” purchase “your day” “pleased to” announce “this weekend” announced awards award ceremony “live stream” streaming “live show” streamed expo “long awaited” “staff pick” “editor pick” sponsor sponsors “ready to” “see what” “featuring in” “start off” “off your” “coming soon” starts “we are” ‘participant “review the” reviews awaited “what’s your” magazine “new article” “make your” download YouTube “get your” “have you” celebrate “stay tuned” “for our” “supported us” “for supporting” “supporting us” launches review launched “support from” “home show” “my video” “available in” “available at” “new post” news “my phone” camera “you can” “you can’t” commercial school schools college university “free pack” donated donate notifications graduation “sign up” recipe “keep your” “watch your” “watch this” “you have” “come your” “do you” “picture was” “photo was” “photo is” “picture is” comments “this video” “this photo” “this picture” god “who needs” “contact us” contact movie “do you” “did you” “happened ?” shipped “tagged you” tagging posting “are you” www “www.” advt “for those” “click here” “my story” video “of you” contest voucher coupon coupons advertisement comic email upload share “check out” “prime time” parenting podcast #podcast massage podcast exhale massage chat “email to” recipe chef charity sponsor sponsored “check out” check posted blog website “review this” “support us” career jobs placement vacancy “new opening” downloaded trump election elections hillary obama politics #deals “free for” “free with” “why should” “why are” jobs outsourcing reviews comics magazine news “read this” article “blog post” “win” “winners” hillary obama trump donald clinton deals offers offer free “watch out” “tag me” “tag you” tagged

What If you don’t know or forgot to specify Reject terms?

In order to identify and remove unwanted text entries from the dataset created especially from social media or web review site sources, we have a “View/Filter” clean up tool for the dataset. Here is how to use it

  1. Once you have created the dataset, Click on the Actions => View/Filter button.
  2. This will open up a window and show a list of “Top Single word occurrence”, ‘Top Two Word Occurrence” and then “Low Distribution Two Words..”
  3. The “Top Single Words” list is sorted by the frequency of the words present in the data. Review this. If you find irrelevant or unsuitable words or phrases including any words indicative of spam in the list, select the associated “Reject” button to “Add to” the reject list, Then click on the “Clean up” button. You can then confirm and run the clean up job. The clean up job will remove the entire text block (commentary, sentence) from the entire dataset.
  4. You can similarly review words present in the other two lists (two word occurrence and low distribution) to identify and reject the term. In two words (phrases) ,you can choose to reject the right or left word and/or the entire phrase.
  5. You can run the “View/Filter” job multiple times on a dataset. Typically multiple iterations may be needed in a noisy dataset. However, be careful that you select only the right words to filter so as not to lose any crucial information.