When applying machine learning to fine art paintings, one obvious approach is to analyse the visual content of the paintings. We discuss two major problems which caused us to take a semantic route instead: (i) state-of-the-art image analysis has been trained on photos and does not work well with paintings; (ii) visual information obtained from paintings is not sufficient for building a curatorial narrative.
Let us start by using the DenseCap online tool to automatically compute captions for a photo of two dogs playing.
The DenseCap model correctly identifies the dogs and many of their properties (e.g. “the dog is brown”, “the dog has brown eyes”, “the front legs of a dog”, “the ear|head|paw of a dog”) as well as aspects of the backgound (“a piece of grass”, “a leaf on the ground”). There are some wrong captions for the dogs (“the dogs tail is white”, but there is no tail in the picture) and for the background also (“the fence is white”). But all in all the computer vision system does a good job in what it has been trained to do: localize and describe salient regions in images in natural language [Johnson et al 2016].
Let us now apply this system to a fairly realistic dog painting from the collection of our partner museum Belvedere, Vienna.
Again many characteristics of the dog are correctly identified (“the dog is looking at the camera”, “the eye|ear of the dog”) and also “a bowl of food”. The background already provides more problems, with some confusions still comprehensible (“a white napkin”, “the curtain is white”) but others less so (“the flowers are on the wall”).
Testing the system on a more abstract painting, Belvedere’s collection highlight “The Kiss” by Gustav Klimt, yields even stranger results.
While some captions are correct (“the mans hand”, “the dress is yellow”, “the flowers are yellow|green”), others are are somewhat off (“the hat is on the mans head”) or just completely wrong (“the picture is white”, “the wall is made of bricks”, “a black door”, “a window on the building”). The essential aspect of the painting, a man and a women embracing, is not comprehended at all. Of course this is understandable since the DenseCap system has been trained on 94,000 photo images (plus 4,100,000 localized captions) but not on fine art paintings, which explains that it cannot generalize to more abstract forms of art.
On the other hand, even if an image analysis system could perfectly detect that “Caesar am Rubicon” shows a dog looking at sausage in a bowl on a table, it would still not grasp the meaning of the painting: Caesar is both the name of the dog and the historical figure who crossed the Rubicon which was a declaration of war on the Roman Senate ultimately leading to Caesar’s ascent to Roman dictator. Hence “crossing the Rubicon” is now a metaphor that means to pass a point of no return.
The same holds for Gustav Klimt’s “The Kiss”. Even if the image analysis system were not fooled by Klimt’s use of mosaic-like two-dimensional Art Nouveau organic forms and would be able to detect two lovers embracing in a kiss, it would still not grasp the significance of the decadence conveyed by the opulent exalted golden robes or the possible connotation to the tale of Orpheus and Eurydice.
The DAD project is about exploring the influence of Artificial Intelligence on the art of curating. From a curatorial perspective, grasping the semantic meaning of works of art is essential to build curatorial narratives that are not just based on a purely aesthetic procedure. See our previous blogposts  on such a semantic driven approach towards the collection of the Belvedere, where we chose to analyse text about the paintings rather than the paintings themselves.