# ML terminology, demystified

### Deep Learning (DL)

Use of large (or deep) neural networks. The primary advantage is the use of many layers that carry out automated feature extraction, rather than rules-based feature extraction, or feature engineering (research into pipelines that extract the optimal features).

### Artificial Intelligence (AI)

I use this term to mean artificial general intelligence (machines that write their own programs from scratch, based on a comprehension of the task at hand). This doesn't exist in the Earth and Life sciences yet! In my opinion, conflating ML with AI is confusing and counter-productive, and it leaves no room for growth.

### Model

A model of the data, if you will, or the way that data generates labels, to be more specific. The representation of what a machine learning system has learned from the training data.

Within TensorFlow, and Machine Learning more widely, the term `model`

is overloaded:

- The model technique/class or the specific python object (such as a Tensorflow graph) that expresses the structure of how a prediction will be computed.
- The particular parameters (e.g. weights and biases) of that model, which are determined by training.

### Regularization

The penalty on a model's complexity. Regularization helps prevent overfitting. Different kinds of regularization include:

- L2 regularization: model penalties are squared magnitudes (model fits to the mean)
- L1 regularization: model penalties are absolute magnitudes (model fits to the median)
- dropout regularization
- early stopping (this is not a formal regularization method, but can effectively limit overfitting)
- data augmentation (again, not a formal regularization method)
- batch normalization (see above)

### Data augmentation

The process of creating transformed versions of the data. This process often, but not always, duplicates data. It's primary purpose is regularization, by providing alternative versions of the data, it learns feature representations that are invariant to location, scale, translation, etc. In ML Mondays we use augmentation in a few different ways; in Part 1 we create an augmentation pipeline that augments imagery without duplicating it, and we create and write augmented imagery to disk in Part 3.

### Batch Normalization

Normalizing the input or output of the activation functions in a hidden layer. The normalization is slightly different each time, because it is dictated by the distribution of values in the batch. So therefore its effects are contingent in part to the size of the batch. Batch normalization can provide the following benefits:

- Make neural networks more stable by protecting against outlier weights.
- Enable higher learning rates, which in turn promotes faster training
- Reduce overfitting.

### Dropout Regularization

A form of regularization useful in training neural networks. Dropout regularization works by removing (actually, zeroing out temporarily) a random selection of a fixed number of the units in a network layer for a single gradient step. The more units dropped out, the stronger the regularization.

### Early Stopping

A method for regularization that involves ending model training before training loss finishes decreasing. In early stopping, you end model training when the loss on a validation dataset starts to increase, that is, when generalization performance worsens.

### Epoch

A full training pass over the entire dataset such that each example has been seen once. Thus, an epoch represents N/batch size training iterations, where N is the total number of examples.

### Convergence

Informally, often refers to a state reached during training in which training loss and validation loss change very little or not at all with each iteration after a certain number of iterations. In other words, a model reaches convergence when additional training on the current data will not improve the model. In deep learning, loss values sometimes stay constant or nearly so for many iterations before finally descending, temporarily producing a false sense of convergence.

### Convolutional Neural Network

A neural network in which at least one layer is a convolutional layer. A typical convolutional neural network consists of some alternate combinations of

- convolutional layers
- pooling layers

Followed by application of

- dense (or fully connected) layers

### Cross-Entropy

A measure of the difference between two probability distributions. A generalization of Log Loss to multi-class classification problems.

### Discriminative model

A model that predicts labels from a set of one or more features. More formally, discriminative models define the conditional probability of an output given the features and weights; that is:

p(output | features, weights)

The vast majority of supervised learning models, including classification and regression models, are discriminative models.

### Hyperparameter

The "knobs" that you can tweak during successive runs of training a model. For example, learning rate is a hyperparameter. Contrast with parameter, which is learned by the model during training

### Loss (or cost)

A measure of how far a model's predictions are from its label. Or, a measure of how bad the model is. To determine this value, a model must define a loss function.

### one-vs.-all

Given a classification problem with N possible solutions, a one-vs.-all solution consists of N separate binary classifiers—one binary classifier for each possible outcome. For example, given a model that classifies examples as animal, vegetable, or mineral, a one-vs.-all solution would provide the following three separate binary classifiers:

- animal vs. not animal
- vegetable vs. not vegetable
- mineral vs. not mineral

this is the approach we take in segmentation.

### Optimizer ("numerical solver")

A specific implementation of the gradient descent algorithm. TensorFlow's base class for optimizers is `tf.train.Optimizer`

. Popular optimizers include:

- AdaGrad, which stands for ADAptive GRADient descent.
- Adam, which stands for ADAptive with Momentum (note that
`rsmprop`

is adam without momentum)

### Supervised machine learning

Training a model from input data and its corresponding labels. Supervised machine learning is analogous to a student learning a subject by studying a set of questions and their corresponding answers. After mastering the mapping between questions and answers, the student can then provide answers to new (never-before-seen) questions on the same topic. Compare with unsupervised machine learning.

### Semi-supervised learning

Training a model on data where some of the training examples have labels but others don’t. One technique for semi-supervised learning is to infer labels for the unlabeled examples, and then to train on the inferred labels to create a new model. Semi-supervised learning can be useful if labels are expensive to obtain but unlabeled examples are plentiful.

### Unsupervised learning

Training a model to discover both the classes, and which class each sample represents. The 'holy grail' of Machine Learning, in many ways.

### Transfer learning

Transferring information from one machine learning task to another. For example, in multi-task learning, a single model solves multiple tasks, such as a deep model that has different output nodes for different tasks. Transfer learning might involve transferring knowledge from the solution of a simpler task to a more complex one, or involve transferring knowledge from a task where there is more data to one where there is less data.

Most machine learning systems solve a single task. Transfer learning is a baby step towards artificial intelligence in which a single program can solve multiple tasks.

### Tensor

The primary data structure in TensorFlow programs. Tensors are N-dimensional (where N could be very large) data structures, most commonly scalars, vectors, or matrices. The elements of a Tensor can hold integer, floating-point, or string values.

### GPU "Determinism"

There is a lot of random parts to training a deep learning model, such as batches, augmentation, regularization, etc. When you run operations on several parallel threads, you typically do not know which thread will end first. It is not important when threads operate on their own data, so for example, applying an activation function to a tensor should be deterministic. But when those threads need to synchronize, such as when you compute a sum, then the result may depend on the order of the summation.

See this ML Mondays blog post for what we are doing in this course to counter these effects

### "Eager execution"

A TensorFlow programming environment in which operations run immediately. By contrast, operations called in graph execution don't run until they are explicitly evaluated. Eager execution is an imperative interface, much like the code in most programming languages. Eager execution programs are generally far easier to debug than graph execution programs.

### "Graph execution"

The opposite of eager execution. A TensorFlow programming environment in which the program first constructs a graph and then executes all or part of that graph. Graph execution was the default execution mode in TensorFlow 1.x. A give-away is use of `tf.session`

Adapted from entries in https://developers.google.com/machine-learning/glossary