Shh, will you. Some truths are not to be aired in public.
We know that no manager got fired for choosing Java.
There is a researcher's version of that. No researcher got fired for making a neural network more 'convoluted'. It helps if there exists one dataset where it does 0.3% better. Doesn't matter if that data set is(has been since the late 90s) standard fare as a homework problem in machine learning course.
That said we do understand these things a bit better than before. Some concrete math is indeed coming out.
More layers allowed us to explore exponentially more network architectures. And if you look at a lot of advances in deep learning, particularly in convnets, the architecture is actually key - as important or more than the weights themselves. I guess another thing is that more layers give a disproportionate increase in performance. Some of it is hardware but there have definitely been advances in the theory; people aren't getting these new results from 10, 20 yr old networks that have been made larger.