Machine Learning with Big Data : Introduction


In a typical Machine Learning process, the first step is generally an exploratory analysis of data in order to gain insights.
This is followed by some pre-processing activities. Some of the common pre-processing steps may be transforming categorical attributes to numerical, handling null values, transformation to handle skewness in distribution of features, etc.

However, for performing Machine Learning with ML Lib there are some specific pre-processing which need to be performed. 

  • In the case of most classification and regression algorithms, you want to get your data into a column of type Double to represent the label and a column of type Vector (either dense or sparse) to represent the features.
  •  In the case of recommendation, you want to get your data into a column of users, a column of items (say movies or books), and a column of ratings.
  • In the case of unsupervised learning, a column of type Vector (either dense or sparse) is needed to represent the features.
  • In the case of graph analytics, you will want a DataFrame of vertices and a DataFrame of edges.

In this series, we will look at how to perform each of these steps in pyspark. We will also look at how to operationalize the ML model in databricks using a set of options.

Comments

Post a Comment

Popular posts from this blog

The Plot