Tag Archive: neural networks


By passing millions of ImageNet images through InceptionV1 (state-of-the-art deep convolutional neural network) we can extract the image patches that make specific neurons from various convolutional layers to activate mostly.

By projecting the image patches to 2D using UMAP we can see what the neural network “sees” at the various layers.

This is a great way for explaining how a computer vision model makes its classification decision.

However, the following part of the article was the reason for my post:

“….There is another phenomenon worth noting: not only are concepts being refined as you move from layer to layer, but new concepts seem to be appearing out of combinations of old ones….”

This is how a world of complexity works.

We know that deep neural networks perform hierarchical feature learning and combine simpler features to learn more complex ones. This is one of the reasons why we use deep learning for audio, visual and textual data.

Deep learning can decompose the complexity of data!

Have you ever asked why we randomly initialize the weights of a neural network?

After you read this post you will know why!

When the weights of the neurons in a neural network’s layer are initialized to the same value then all neurons of the layer produce the same output in the forward propagation.

Furthermore, when doing backpropagation the gradients of the loss w.r.t to the weights of the layer are also the same values. So, when training happens with gradient descent the weights change in the same way.

Lastly, when gradient descent converges, the weight matrix of the layer contains the same values for all neurons and thus the neurons have learned the same thing.

To break this symmetry and allow the neurons of the layer (and in general in all layers) to learn new and different features we randomly initialize the weights!

In Machine Learning we use minimum-disturbance learning algorithms (e.g., LMS, Backprop) to tune the trainable parameters of topological architectures. Instead of programming a model we expose it to observations and let the algorithm find its optimal parameters. So, in this approach the solution to a task is the set of trainable parameters.

In biology, precocial species are those whose young already possess certain abilities from the moment of birth. There is evidence to show that lizard and snake hatchlings already possess behaviors to escape from predators. Shortly after hatching, ducks are able to swim and eat on their own, and turkeys can recognize predators. These species already have abilities before they are even exposed to the world.

Neuroevolution of augmenting topologies is a research field that aims to simulate exactly this behavior. Specifically, neural network architectures can be found that perform various machine learning tasks without weight training. So, in this approach the solution to a task is the topological architecture.

For more information on this idea check the paper “Weight Agnostic Neural Networks by (Adam Gaier, David Ha).

Linear / Logistic regression learning algorithms can be used to learn surfaces to fit non-linear data (either for regression or classification).

It can be done by adding extra polynomial terms (quadratic, cubic, etc) in the dataset using the existing features. The approximated surfaces can be circles, ellipses, parabolas and hyperbolas.

However, (1) it is difficult to approximate very complex non-linear surfaces. Also, when the features are many then (2) it is very computationally expensive to calculate all polynomial terms (eg: for a 100×100 grayscale images we need 100000 quadratic terms).

Usually, due to the limitations (1) and (2) we either perform dimensionality reduction to reduce the number of features (eg: eigenfaces for face recognition) or use more complex non-linear approximators such as Support Vector Machines and Neural Networks.

Some personal notes to all AI practitioners!

In Linear Regression when using the loss function MSE it is always a bowl-shaped convex function and gradient descent can always find the global minima.

In Logistic Regression if we use the MSE then it will not be a convex function because the hypothesis function is non-linear (it uses a sigmoidal activation). Thus, it will be harder for gradient descent to find the global minima. However, if we use the cross-entropy loss it will be convex and gradient descent can easily converge to global minima!

Support Vector Machines have also convex loss function.

We should always use a convex loss function so that gradient descent can converge to the global minima (local optima free).

Neural Networks are very complex non-linear mathematical functions and the loss function most often is non-convex, thus it is usual to stuck in a local minima. However, most optimization problems in Neural Networks are due to long plateau and saddle points rather than local minima. For such problems advanced gradient descent optimization variants were invented (eg: Momentum, Adam, RMSprop).

Happy optimizations!