Professional Data Scientist Network

Machine Learning


Enables flexible, scalable personalization for increased customer engagement and lifetime value.

A recommender is a type of machine learning algorithm that enables personalization. The technique is also known as collaborative filtering. Given user-item interaction data, recommendation systems can recommend new items for the users based on their past interactions. A recommender system can also detect similar items or similar users: which items draw the same crowd, and which group of users like the same items. Personalized recommenders are widely used on movie rental or music sharing sites, as well as retail and news outlets. Good recommenders are proven to increase user interaction and drive revenue.
  • Factorization machine
  • Matrix factorization
  • Item similarity (also known as neighborhood-based collaborative filtering)
  • Popularity-based recommender
  • Accept explicit ratings as well as implicit interaction data
  • Optimize for either ranking and rating predictions
  • Allow side information for users and items via the factorization machine
  • Multiple solvers for matrix factorization, such as alternating least squares (ALS), implicit alternating least squares (IALS), stochastic gradient descent (SGD), and adagrad
  • Binary matrix factorization
  • Non-negative matrix factorization
  • Sample unobserved data to optimize for ranking
  • Find similar items


Classification is the problem of making a discrete prediction using training data. The key difference between regression and classification is that in regression the target is continuous while in classification, the target is categorical. A classifier can be used for several applications including:
  • Fraud Detection
  • Outlier Detection
  • Personalization
  • Click-through rate prediction
  • Churn Prediction
  • Object recognition from Images
  • ...and many more uses.
Classification is usually one of the most fundamental tasks in machine learning which powers many of the world’s most innovative intelligent applications.
  • Logistic regression
  • Nearest neighbor classifier
  • Support vector machines (SVM)
  • Boosted Decision Trees
  • Random Forests
  • Neural network classifier (deep learning)
Works with all your data
All models can incorporate the following set of rich feature types:
  • Numeric features 
  • Categorical features
  • Sparse features (i.e feature sets that have a large set of features, of which only a small subset of values are non-zero)
  • Dense features (i.e feature sets with a large number of numeric features)
  • Text data
  • Images
In addition to these feature types, models like the neural networks, boosted decision trees, and random forests can be used for feature extraction.

Groups data points into clusters that are similar to each other.

Clustering is the fundamental machine learning task of separating data into similar groups where there aren't nice class labels in a training dataset. Clustering is often done in the exploratory data analysis phase to get a better intuition about the structure of a dataset, or as a preliminary step for more complicated models. There are countless clustering algorithms, but very few implementations scale well to the size of modern datasets. 

Our clustering tools work with the highly efficient and optimized SFrame tabular data structure, which means they efficiently scale to very large datasets. 

Data Matching

Enables aggregation and linking of arbitrary data from different sources for the purposes of deduplication and organizing content from large collections of unstructured data for easier navigation and more thorough understanding.

  • Auto-Tagging, which is matching unstructured text records to a pre-existing set of tags. For example, when building a news aggregation news service, this algorithm can used to automatically categorize content into a set of topics.
  • Deduplication, reduce data duplication by identifying records in tabular datasets that correspond to the same entity. This is helpful in cleaning and merging similar data from different data sets when the data may have input errors, provided by multiple sources, etc.

Deep Learning

Provides users with state-of-the-art algorithms for classification. The resulting trained models can also transform input into features that are useful in the context of other machine learning tasks such as regression, clustering, or finding nearest neighbors.

  • The training of a powerful classifier on image or numeric data. Deep neural networks are the state of the art in many machine learning tasks, especially in the context of images.
  • Deep Features: The use of a pre-trained model to transform input into features which are rich and meaningful. These features can then be used in the context of other supervised or unsupervised machine learning tasks. This technique allows one to capture the power of Deep Learning without having to have the expertise or time to train it from scratch.
With GPU, training deep networks on millions of images will become feasible and easy, with throughput rates of ~200 images/second on a single GPU. Multiple GPU's allow for nearly linear scaling of throughput, meaning that you will be able to iterate and experiment with different architectures and parameters.

Nearest Neighbors

Provides the most flexible way to search for data points that are similar to a set of queries, and does so at interactive speed.

Finding the nearest neighbors of a set of query data points is a core component of many machine learning algorithms. The nearest neighbor classifier, for example, predicts the label for a query point based on the labels of the closest points in the training set. "Composite distance" functions allow comparisons between data points with any mix of data types, like numeric values and free text.

Text Analysis

Text analysis, text classification, natural language processing, sentiment analysis, semantic analysis.

  • Feature engineering: Raw, unstructured text is often transformed into a variety of numeric representations so that it can be used as input for machine learning models. For example, we make it easy to create a bag-of-words representation of your data, where we compute the number of times each unique word occurred. We also support similar functionality for general n-grams, either for words or characters.
  • Rransformations: In conjunction with the feature_engineering toolkit, it's straightforward to reweight these values via TF-IDF or BM25 (which can be especially useful for ranking documents by relevance).
  • Modeling: We provide scalable implementations of models (LDA) that can cluster documents according to topic. 
  • Tokenizing: Makes it easy to tokenize sentences into words and use this as input to other Python tools that are available, such as NLTK, gensim, and so on.

Use Cases

Recommendation Engine
Recommendation engines can increase customer engagement which generally results in higher customer lifetime value. Recommendations can be based either on user preferences or intense categorization of the items or content. 
Recommendation engines are often used for:
  • Movies
  • Music
  • News
  • People
  • Products
  • Real estate
  • Recipes
  • Videos
Recommendation engines are a commonly referenced application of big data analytics. We encounter these systems at retailers who show us products typically purchased together or at online services that curate books and music for us based on our preferences. Recommendation engines have significant proven ROI because they consistently boost sales as well as customer satisfaction.

Customer Churn
React to early indicators of customer dissatisfaction to reduce loss of users and recurring revenue with customer churn models.

Predicting customer “churn” - when a customer will leave a provider of a product or service in favor of another - is a valuable application for machine learning. The ability to prognosticate this customer move requires establishing correlations across a wide variety of data including communications types and frequency that might signal preferences and finally intent. Churn prediction is particularly important in the telecommunications industry where a small number of mobile services providers must compete for a relatively finite customer base. The analysis has to be conducted frequently to ensure continued customer satisfaction through improved customer service and targeted offers. It also has to be applied to a large and varied body of data which grows exponentially and in lock step with the size of the subscriber’s social network.

Customer Segmentation
Personalize your customer interactions and increase customer loyalty with customer segmentation machine learning models.

Data science and by extension analytics are transforming marketing into a highly targeted, contextual activity that aims at ideally matching the product to buyer needs and wants. By analyzing customer purchase histories and patterns of interaction with the product and service, sellers can, not only refine their offers to well known market segments but also identify entirely new segments whose preferences were previously hidden in mounds of data.

Fraud Detection
Identify and prevent illegal financial activity faster with fraud detection machine learning models.

Machine learning holds a great deal of promise for the area of fraud detection analytics (FDA). While it’s still early days for the discipline, it is well understood that analysis of financial transactions, email, customer relationships and communications can help identify fraudulent activity and even predict it before it has occurred saving financial services firms millions in lost revenue.

Sentiment Analysis
Understand customer preferences, feedback, and intent more accurately with sentiment analysis machine learning models.

Sentiment analysis essentially amounts to text and document classification but instead of topic the identifier is sentiment, as in a positive review or a secretive tone. Sentiment analysis is made complex because looking for keywords often is not enough to infer a sentiment from a sentence or a string of words. This makes sentiment analysis a big data problem that can benefit from machine learning models. Accurate sentiment identification has a myriad of applications from making recommendations more accurate, to sorting documents, books and reviews for easy classification. Identifying a person’s intent can also help in fraud detection, churn prediction as well as many other business intelligence applications.

We provide high performance algorithms for:

  • Recommenders
  • Data matching
  • Deep learning
  • Sentiment analysis
  • Churn prediction
  • Personalization
  • Object recognition
  • Topic modeling
  • Classification
  • Clustering
  • Regression
  • Graph analytics
  • Neural networks
  • Matrix factorization
  • Image processing
  • Text analytics