• Home
  • About
    • Marsan Ma's info photo

      Marsan Ma's info

      Marsan Ma's info

    • Learn More
    • LinkedIn
    • Github
    • Email
  • Posts
    • All Posts
    • All Tags
  • Projects

Dato Challenge: native ad classifier

15 Aug 2015

Reading time ~1 minute

“Truly Native?”

My code doing the Data Kaggle challenge: Here is the Github repository of this work.

Briefing

In this challenge, we are requested to develop a system predicting whether an article is a native advertisement. It’s involves lots of natural language processing tricks, like boilerplate removing, topic modeling and embedding techniques. My final model is a blending of several bagging logistic regression, two gradient boost trees, and two field-aware factorized machine, which achieved 0.974 AUC in private board. (while 1st prize winner achieves 0.998).


Code Hierarchy

The entry script is go_parse.sh and go_train.sh, every thing start with these two. While the former is used to parse data into mongodb and latter do the model training and evaluation.

There are a couple of different models to choose, and feature selection, blending, hyperparameter searching experiments which you could choose the function and tune the setting in main.py.

About dataflow

Here I tried to separate data pipelines into independent stages. Like, you could change input data format without refactoring the learning models or evaluation stages, and vice versa. Senior Kagglers might find this project get an overkill big structure, that’s because this framework try to become a generic framework for any kinds of data science products.



machine learninglogistic regressionrandom forestsvmblending modelkaggledato Like Tweet +1