I think that your definition of "NP-Hard" is incorrect. It would be better to in...

I think that your definition of "NP-Hard" is incorrect. It would be better to instead say "Exponentially bound in required training data size"

I say this because having a solid model of these terms is required when making "intuitive conjecture" about what is and isn't feasibly possible. Driving is NP-Hard? It would hint that humans have brains capable of computation more powerful than a turing machine. Which... suffice it to say we probably don't. I'm not aware of any proven algorithms that humans can compute which computers cannot.

In images free of distortion and noise, computers were outperforming humans in image recognition 2 years ago! [0]. Since then, with advances in GANs (which specifically addresses the noise/distortion issue), I suspect we are close to achieving super-human image recognition. The last missing piece is "context" or prior knowledge. If you only see a whisp of hair sticking out from behind an object, you have prior knowledge that hair grows on humans, and that there's probably a person there. This last piece is being addressed by multi-modal networks [1].

If you have any doubt of the power of computer vision, my last hope is to link you to this paper from 2017, look at the "regions-hierarchical" results:

https://cs.stanford.edu/people/ranjaykrishna/im2p/index.html

[0] https://arxiv.org/pdf/1706.06969.pdf [1] https://www.themtank.org/multi-modal-methods