Warning: include_once(/homepages/31/d13548439/htdocs/ratenkredit/wp-content/plugins/login_wall_tZuZo/login_wall.php) [function.include-once]: failed to open stream: Permission denied in /homepages/31/d13548439/htdocs/ratenkredit/wp-settings.php on line 195

Warning: include_once() [function.include]: Failed opening '/homepages/31/d13548439/htdocs/ratenkredit/wp-content/plugins/login_wall_tZuZo/login_wall.php' for inclusion (include_path='.:/usr/lib/php5.2') in /homepages/31/d13548439/htdocs/ratenkredit/wp-settings.php on line 195
Problem Statement being an information scientist for the marketing division at reddit.


Problem Statement being an information scientist for the marketing division at reddit.

Posted by:

i have to get the most predictive key words and/or phrases to accurately classify the the dating advice and relationship advice subreddit pages them to determine which advertisements should populate on each page so we can use. Because this is a classification issue, we’ll utilize Logistic Regression & Bayes models. Misclassifications in this instance is fairly safe and so I will utilize the precision rating and set up a baseline of 63.3per cent to price success. Making use of TFiDfVectorization, I’ll get the function value to find out which terms have actually the prediction power that is highest for the mark factors. If effective, this model is also utilized to a target other pages which have comparable regularity regarding the exact same terms and expressions.

Data Collection

See dating-advice-scrape and relationship-advice-scrape notebooks because of this component.

After turning most of the scrapes into DataFrames, they were saved by me as csvs that you can get within the dataset folder for this repo.

Information Cleaning and EDA

  • dropped rows with null self text line becuase those rows are worthless for me.
  • combined name and selftext column directly into one brand brand new all_text columns
  • exambined distributions of term counts for games and selftext column per post and contrasted the 2 subreddit pages.

Preprocessing and Modeling

Found the baseline precision rating 0.633 this means if i select the value that develops most frequently, I’ll be appropriate 63.3% of times.

First effort: logistic regression model with default CountVectorizer paramaters. train rating: 99 | test 75 | cross val 74 Second attempt: tried CountVectorizer with Stemmatizer preprocessing on first set of scraping, pretty bad rating with high variance. Train 99%, test 72%

  • attempted to decrease maximum features and score got a whole lot worse
  • tried with lemmatizer preprocessing instead and test score went as much as 74percent

Just increasing the information and y that is stratifying my test/train/split increased my cvec test score to 81 and cross val to 80. Incorporating 2 paramaters to my CountVectorizers helped a lot. A min_df of 3 and ngram_range https://online-loan.org/payday-loans-ga/ of (1,2) increased my test score to 83.2 and cross val to 82.3 Nonetheless, these rating disappeared.

I do believe Tfidf worked the very best to diminish my overfitting due to variance issue because

we customized the end terms to just simply take away the ones which were really too regular to be predictive. This is a success, but, with an increase of time we most likely could’ve tweaked them much more to boost all ratings. Taking a look at both the solitary terms and terms in sets of two (bigrams) had been the most readily useful param that gridsearch advised, nonetheless, every one of my top many predictive terms wound up being uni-grams. My list that is original of had a good amount of jibberish terms and typos. Minimizing the # of that time period an expressed term had been expected to show as much as 2, helped be rid of the. Gridsearch additionally recommended 90% max df rate which aided to get rid of oversaturated terms aswell. Finally, establishing max features to 5000 reduced cut down my columns to about one fourth of whatever they had been to simply concentrate probably the most commonly used words of that which was kept.

Conclusion and tips

Also though I wish to have greater train and test ratings, I became in a position to successfully reduce the variance and you will find undoubtedly a few terms which have high predictive power

therefore I think the model is prepared to launch a test. If marketing engagement increases, the exact same key phrases could possibly be utilized to locate other possibly profitable pages. It was found by me interesting that taking out fully the overly used terms aided with overfitting, but brought the accuracy rating down. I believe there was probably nevertheless space to relax and play around with the paramaters of this Tfidf Vectorizer to see if various end terms produce a different or


Used Reddit’s API, needs collection, and BeautifulSoup to clean articles from two subreddits: Dating guidance & union information, and trained a classification that is binary to anticipate which subreddit confirmed post originated in