What you interact with as the user is the model and its weights.
The model (presumably some kind of convolutional neural network) has many layers, every layer has some set of nodes, and every node has a weight, which is just some coefficient. The weights are 'learned' during the model training where the model takes in the data you mention and evaluates the output. This typically happens on a super beefy computer and can take a long time for a model like this. As images are evaluated the output gets better the weights get adjusted accordingly.
Now we as the user just need the model and the weights!
The model (presumably some kind of convolutional neural network) has many layers, every layer has some set of nodes, and every node has a weight, which is just some coefficient. The weights are 'learned' during the model training where the model takes in the data you mention and evaluates the output. This typically happens on a super beefy computer and can take a long time for a model like this. As images are evaluated the output gets better the weights get adjusted accordingly.
Now we as the user just need the model and the weights!