"Gaussian Processes (GPs) are known to provide accurate predictions and uncertainty estimates in small data settings by capturing similarity between data points through their kernel function. However traditional GP kernels do not work well with high dimensional data points. A solution is to use a neural network to map data to low dimensional embeddings before kernel computation. However the huge data requirement of neural networks makes this approach ineffective in small data settings. We solve the conflicting issues of representation learning and data efficiency, by mapping high dimensional data to low dimensional probability distributions using a probabilistic neural network and then computing kernels between these distributions to capture similarity. We also derive a functional gradient descent approach for end-to-end training of our model. Experiments on various datasets show that our approach outperforms the state-of-the-art in GP kernel learning."