Tap-News is a complex web application for users to view latest news around the globe. It is consist of a React+Redux front-end, multiple backend services including RESTful servers and RPC servers, a MongoDB database, a news scraper using Python Scrapy package as well as multiple cloud RabbitMQ message queues. Its main functionality includes monitor a wide ranges of new sources, fetch the latest news using scraper, and sensitively deduplicate the repeatitive news with a smart use of TfidfVectorizer from sklearn module without losing efficency. In addition, the web application implements the functionality of classifying news by topics and recommending news according to user preference with time decay model.
Tap news structure
Topic modeling & news training& feature engineeringI attempts to classify the news collected by the news api into different categories such as crime, media, politics and so on based on news' titles. This can be trained with idea of machine learning. First, I collected 500 news title with labeled topic from crowdsourcing and use it as training data. The training process includes some feature engineer such as word stemming and stoppable words. The training process convert the words into its stem form and ignore some stoppable words, for example, 'likes' to 'like' , and words like 'in' 'on' 'over' 'therefore' are useless and it can be ignored and impreove it effiency. Then the training process is done by module tensorflow, which can turns data into its vectorized form. The key algorthim is to use the convolutional neural networks and pooling twice to train the data, eventually the accruacy is over 50%. After that, we can recommend the related news to users base on the labeled topic.
Monitor&Fetch&DeduplerI attempts to monitor all kinds of big sources of news such as BBC, CNN , Bloomberg and so on by the newspaper api. If there are recent news coming out, it will notify with a message by Cloudamqp. After our applications is notified, it will attempt to capture the recent news based on the title and then notify our news depuler to eliminate the same news from different sources to improve the user's experience, because users dont want to read the repeatitive news.
Click log & user's preference using time decay modelI perserve the user's log for analyzing their perferences of news, and base on that I can recommend the news they love to read. Since base on logs , we can easily to know what kinds of topic of news will attract users' clicks. I also apply a time decay model into the application , because even users' love certain topics of news, but it is not wise to recommend some news with user's interests from a long time ago. Time decay model is baseon on if selected p=(1-a)*p+a if not selected: p= (1-a)*p, where as p is probability to recommend the news, a denotes as a coeffiecient depends on time passed.