# Pooling, Normalizations, and Custom layers

Resources this week

1. [Ian Goodfellow on BatchNorm](https://www.youtube.com/watch?v=Xogn6veSyxA&t=325s)
2. Slide decks [5](http://cs231n.stanford.edu/slides/2022/lecture_5_ruohan.pdf), [6](http://cs231n.stanford.edu/slides/2022/lecture_6_jiajun.pdf) and [7](http://cs231n.stanford.edu/slides/2022/lecture_7_ruohan.pdf) from Stanford CS231n
3. [Practical Recommendations for Gradient-Based Training of Deep
Architectures](https://arxiv.org/pdf/1206.5533v2.pdf) by Yoshua Bengio

The slide decks from Stanford are incredibly detailed and they *will* get you up to speed on many, many things about CNNs as well as provide practical recommendations on how to effectively train a network. We would't be able to do a better job at explaining these details so we are simply going to provide the links. It is highly recommended that you check them out and go through them at some point.

In [None]:
import tensorflow as tf
import numpy as np

# Pooling

Intuitively, the exact location of a feature is less important than its rough location relative to other features. This is the idea behind the use of pooling. The pooling layer serves to progressively reduce the spatial size of the representation, to reduce the number of parameters, memory footprint and amount of computation in the network, and hence to also control overfitting.

TLDR; Pooling is a form of non-linear downsampling to reduce memory usage and add local translation invariance.

There are two main types pooling: average pooling and max pooling. They partition the input image into a set of rectangles and, for each such sub-region, average pooling ouputs one value that is the mean of all the values in the sub-region and max pooling outputs the highest value in the subregion.

![maxpool](https://www.researchgate.net/publication/340812216/figure/fig4/AS:928590380138496@1598404607456/Pooling-layer-operation-oproaches-1-Pooling-layers-For-the-function-of-decreasing-the.png)




In [None]:
# create a tensor with small ints and convert to float
A_np = np.random.randint(0, 9, size=(4,4))
A = tf.constant(A_np, tf.float32)
print(A_np)
# apply 2x2 max pooling
maxpooled = tf.keras.layers.MaxPooling2D((2,2))(A) # will get an error, why?

In [None]:
A = tf.reshape(A, (1,4,4,1))
maxpooled = tf.keras.layers.MaxPooling2D((2,2))(A)
tf.reshape(maxpooled, (2,2)) # for better printing

In [None]:
avgpooled = tf.keras.layers.AveragePooling2D((2,2))(A)
tf.reshape(avgpooled, (2,2)) # for better printing

Let's try to simulate what happens if my activations shift a pixel to the right. We will shift all values to left and add zero padding.

In [None]:
pixel_shift = np.concatenate((np.zeros((4,1)), A_np[:,:3]), axis=1)
print(pixel_shift)
shifted = tf.constant(pixel_shift, tf.float32)
shifted = tf.reshape(shifted, (1,4,4,1))
maxpooled = tf.keras.layers.MaxPooling2D((2,2))(shifted)
tf.reshape(maxpooled, (2,2)) # for better printing
# tf.reshape(pixel_shift, (4,4)) # for better printing

My guess is that it will be very similar to the non-shifted version. Over a large image, the changes are very minor. Additionally, in a large network we typically have many maxpooling layers. So by using maxpooling, we can account for small shifts. This is called local translational invariance.

## ⏰Exercise
Will average pooling also result in local translation invariance? Briefly tell us or show us why you think so.

# Batch Normalization
If the input activations are too large or are scaled differently (for example if one takes values between (100, 200) and other between (1,2)), that could lead to a very diffcult optimization. So, what if we just forced all the activations to behave "nicely"? We can do that using Batch Normalization.
I am not going to write the full mathematical details for Batch Norm. I highly recommend checking out the provided resources to learn more. This will help you understand when it can be helpful.

Batch Norms (or any other norm layers) are tricky and there is not always an agreement on what is the most useful way to apply them. They also don't necessarily work well with some activations or architectures. Although you might find some literature on whether you should be using BatchNorm or not for your use case, many times, trial and error is your best friend. The order of ops I normally follow is:

-> CONV/FC -> BatchNorm -> ReLu(or other activation) -> Dropout -> CONV/FC ->

In [None]:
# in tensorflow, we can use batchnorm as follows. Please look at the documentation
# for usage instructions.
tf.keras.layers.BatchNormalization()

# Custom Layers

Most of the time when writing code for machine learning models you want to operate at a higher level of abstraction than individual operations and manipulation of individual variables.

Many machine learning models are expressible as the composition and stacking of relatively simple layers, and TensorFlow provides both a set of many common layers as well as easy ways for you to write your own application-specific layers either from scratch or as the composition of existing layers.

The Layer class is one of the fundamental abstractions in TensorFlow provided via the `tf.keras.Layer` class. A Layer encapsulates a state (weights) and some computation (defined in the call method). A `tf.keras.Layer` has three important things in it:
1. `__init__` , where you can do all input-independent initialization
2. `build`, where you know the shapes of the input tensors and can do the rest of the initialization
3. `call`, where you do the forward pass computation




The best way to implement your own layer is extending the `tf.keras.Layer` class and implementing aforementioned three things.

Note that you don't have to wait until `build` is called to create your variables, you can also create them in `__init__`. However, the advantage of creating them in `build` is that it enables late variable creation based on the shape of the inputs the layer will operate on. On the other hand, creating variables in `__init__` would mean that shapes required to create the variables will need to be explicitly specified.

A simple layer (without the build method) would look like:

In [None]:
class LinearWithoutBuild(tf.keras.layers.Layer):
    """y = w.x + b"""

    def __init__(self, units=32, input_dim=32):
        super().__init__()
        w_init = tf.random_normal_initializer()
        self.w = tf.Variable(
            initial_value=w_init(shape=(input_dim, units), dtype="float32"),
            trainable=True,
        )
        b_init = tf.zeros_initializer()
        self.b = tf.Variable(
            initial_value=b_init(shape=(units,), dtype="float32"), trainable=True
        )

    def call(self, inputs):
        return tf.matmul(inputs, self.w) + self.b


FYI, units are dimensionality of the output space.

Although the better way is to do:

In [None]:
class Linear(tf.keras.layers.Layer):
    """y = w.x + b"""

    def __init__(self, units=32):
        super().__init__()
        self.units = units

    def build(self, input_shape):
        self.w = self.add_weight(
            shape=(input_shape[-1], self.units),
            initializer="random_normal",
            trainable=True,
        )
        self.b = self.add_weight(
            shape=(self.units,), initializer="random_normal", trainable=True
        )

    def call(self, inputs):
        return tf.matmul(inputs, self.w) + self.b

This way we don't have to always specify the input dimensions, only output dimensions are needed. This is very useful when we building bigger models with our layers in them. We will not have to calculate input and weight dimensions for every layer, only output dimensions will suffice.

You can use this layer like any other layer:

In [None]:
# Instantiate our layer.
linear_layer = Linear(4)

# This will also call `build(input_shape)` and create the weights.
y = linear_layer(tf.ones((2, 2))) # first 2 is batch size
y

And if you change the input dims from 2 to 5:

In [None]:
# Instantiate our layer.
linear_layer = Linear(4)

# This will also call `build(input_shape)` and create the weights.
y = linear_layer(tf.ones((2, 5))) # first 2 is batch size
y

Please note that this layer is only an example. Your shapes, initializations and the operations will depend on what precisely you want to implement.

## ⏰ Exercise
Copy the layer definition without the `build` method here. Initialize the layer to have an output size of 4. Try to run your layer with different input dimensions like we did above without getting errors. What was different this time?

## Trainable and non-trainable weights

Weights created by layers can be either trainable or non-trainable. They're
exposed in `trainable_weights` and `non_trainable_weights` respectively.
Here's a layer with a non-trainable weight:

In [None]:

class ComputeSum(tf.keras.layers.Layer):
    """Returns the sum of the inputs."""

    def __init__(self, input_dim):
        super().__init__()
        # Create a non-trainable weight.
        self.total = tf.Variable(initial_value=tf.zeros((input_dim,)), trainable=False)

    def call(self, inputs):
        self.total.assign_add(tf.reduce_sum(inputs, axis=0))
        return self.total


my_sum = ComputeSum(2)
x = tf.ones((2, 2))

y = my_sum(x)
print(y.numpy())  # [2. 2.]

y = my_sum(x)
print(y.numpy())  # [4. 4.]

assert my_sum.weights == [my_sum.total]
assert my_sum.non_trainable_weights == [my_sum.total]
assert my_sum.trainable_weights == []

## Layers that own layers

Layers can be recursively nested to create bigger computation blocks.
Each layer will track the weights of its sublayers
(both trainable and non-trainable).

In [None]:
# Let's reuse the Linear class
# with a `build` method that we defined above.


class MLP(tf.keras.layers.Layer):
    """Simple stack of Linear layers."""

    def __init__(self):
        super().__init__()
        self.linear_1 = Linear(32)
        self.linear_2 = Linear(32)
        self.linear_3 = Linear(10)

    def call(self, inputs):
        x = self.linear_1(inputs)
        x = tf.nn.relu(x)
        x = self.linear_2(x)
        x = tf.nn.relu(x)
        return self.linear_3(x)


mlp = MLP()

# The first call to the `mlp` object will create the weights.
y = mlp(tf.ones(shape=(3, 64)))

# Weights are recursively tracked.
assert len(mlp.weights) == 6

Note that our manually-created MLP above is equivalent to the following
built-in option:

In [None]:
mlp = tf.keras.Sequential(
    [
        tf.keras.layers.Dense(32, activation=tf.nn.relu),
        tf.keras.layers.Dense(32, activation=tf.nn.relu),
        tf.keras.layers.Dense(10),
    ]
)

## Optional: Tracking losses created by layers

Layers can create losses during the forward pass via the `add_loss()` method.
This is especially useful for regularization losses.
The losses created by sublayers are recursively tracked by the parent layers.

Here's a layer that creates an activity regularization loss:

In [None]:

class ActivityRegularization(tf.keras.layers.Layer):
    """Layer that creates an activity sparsity regularization loss."""

    def __init__(self, rate=1e-2):
        super().__init__()
        self.rate = rate

    def call(self, inputs):
        # We use `add_loss` to create a regularization loss
        # that depends on the inputs.
        self.add_loss(self.rate * tf.reduce_sum(inputs))
        return inputs


Any model incorporating this layer will track this regularization loss:

In [None]:
# Let's use the loss layer in a MLP block.
class SparseMLP(tf.keras.layers.Layer):
    """Stack of Linear layers with a sparsity regularization loss."""

    def __init__(self):
        super().__init__()
        self.linear_1 = Linear(32)
        self.regularization = ActivityRegularization(1e-2)
        self.linear_3 = Linear(10)

    def call(self, inputs):
        x = self.linear_1(inputs)
        x = tf.nn.relu(x)
        x = self.regularization(x)
        return self.linear_3(x)


mlp = SparseMLP()
y = mlp(tf.ones((10, 10)))

print(mlp.losses)  # List containing one float32 scalar

These losses are cleared by the top-level layer at the start of each forward
pass -- they don't accumulate. `layer.losses` always contains only the losses
created during the last forward pass. You would typically use these losses by
summing them before computing your gradients when writing a training loop.

In [None]:
# Losses correspond to the *last* forward pass.
mlp = SparseMLP()
mlp(tf.ones((10, 10)))
assert len(mlp.losses) == 1
mlp(tf.ones((10, 10)))
assert len(mlp.losses) == 1  # No accumulation.

# Let's demonstrate how to use these losses in a training loop.

# Prepare a dataset.
(x_train, y_train), _ = tf.keras.datasets.mnist.load_data()
dataset = tf.data.Dataset.from_tensor_slices(
    (x_train.reshape(60000, 784).astype("float32") / 255, y_train)
)
dataset = dataset.shuffle(buffer_size=1024).batch(64)

# A new MLP.
mlp = SparseMLP()

# Loss and optimizer.
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
optimizer = tf.keras.optimizers.SGD(learning_rate=1e-3)

for step, (x, y) in enumerate(dataset):
    with tf.GradientTape() as tape:

        # Forward pass.
        logits = mlp(x)

        # External loss value for this batch.
        loss = loss_fn(y, logits)

        # Add the losses created during the forward pass.
        loss += sum(mlp.losses)

        # Get gradients of the loss wrt the weights.
        gradients = tape.gradient(loss, mlp.trainable_weights)

    # Update the weights of our linear layer.
    optimizer.apply_gradients(zip(gradients, mlp.trainable_weights))

    # Logging.
    if step % 100 == 0:
        print("Step:", step, "Loss:", float(loss))

## ⏰ Exercise
Make and use a Resnet block using custom layer creation.
A Resnet block looks like this:

![resnet block](https://d2l.ai/_images/resnet-block.svg)

What is the Resnet?

[Paper](https://arxiv.org/pdf/1512.03385.pdf)

[Resources](https://towardsdatascience.com/understanding-and-visualizing-resnets-442284831be8)


You can choose to do either of the two versions.

Your task is to create a custom layer called `resnetBlock` that implements one of the above blocks (and only one block) using the predefined layer functions available to you in Tensorflow (Dense, Conv2D etc). Bear in mind that you are supposed to implement a **custom layer using subclassing** like we have discussed above. Do not use the sequential or the functional API. The number of channels in the output should be an user-defined quantity that you take in as an argument while instatntiating the class.

Once the you define `resnetBlock`, create one instance of it and provide any number of output channels that you wish. Once you create an instance, the weights inside will be initialized and you can perform a forward pass. Show that your layer works using an image of your choice or by creating a random tensor with correct dimensions. You will provide the image or the random tensor to the layer instance that you made as an input and verify output shape. For an input of shape $(B,H,W,C_{in})$ the expected output is of shape $(B,H,W,C_{out})$ (in this case, batch will be 1). There is no training or fitting necessary. We are only going to verify that the forward pass outputs correct shapes.
