Language/visual elements in mass media advertising
M. Dimitrova-Vulchanova, L. Martinez, R. Eshuis
(Norway, Norwegian University of Science & Technology, Trondheim)
Over the past five years or so eye-tracking studies have grown increasingly popular in many “applied”/commercial fields, such as studying subjects’ response and attention in viewing ads, user-interfaces on the internet, as well as in research in human cognition. While the first generations of eye-trackers were extremely unwieldy and difficult to use, the modern versions are user-friendly, require no special installation, allow for relatively fast and reliable calibration and can be used with a variety of age-groups. The vast application of eye-tracking in both commercial and research fields resides in one simple idea, namely that the direction of a person’s gaze can tell a lot about that person’s attention (if she is working on a task), understanding and planning in terms of goals (for future actions). Furthermore, recording one’s gaze pattern does not require any special instructions potentially creating a bias in the completion of the task: one simply needs to ask the subjects to sit in front of the screen and watch the visual input (a video, picture or written material). Exploiting this idea, in our eye-tracking study we set out to investigate how viewers processed and interpreted an ad video by recording their gaze patterns and by asking them to fill out an elicitation questionnaire.
Forty subjects, mainly students from the NTNU University in Trondheim as well as some pupils from local high schools, volunteered in the study out of interest in eye-tracking research or in return for course credit. Subjects were first shown the video while their gaze was recorded (after an initial calibration procedure), and were then given a questionnaire to fill in as completely as possible.
The video was presented to subjects individually on a Tobii 1750 eye-tracker (Tobii Technology; a corneal reflection technique eye-tracker with an integrated 17’’ TFT monitor) in a noise isolated room. The used screen resolution was 800x600 pixels.
The video our subjects saw is short and belongs in the category
sophisticated ad (Hall & Whannel, 2021: 89) by virtue of its design and purpose. It is a parody of the Hollywood classic horror movie
The Exorcist in exploiting the idea of the possessed girl and a scientist trying to save her. The message is subtle in that viewers are expected to get a positive impression of the home university of the student/scientist and to form favourable associations likely to influence their choice of a place to study. The choice of the genre, a horror movie is not accidental: like other ads, this one, too, is designed to shock its audience and attract attention at all costs (Dyer, 2019: 67). As part of the genre package, the video features English audio (even though it targets a local Norwegian audience) with Norwegian subtitles. The sub-genre (or presentation format) selected for this video is a movie-trailer.
There are many good reasons for choosing this particular format. Firstly, young audiences are specifically acquainted with trailers and are used to downloading them from the web, viewing them for information and are familiar with the trailer style. Secondly, a trailer presents vital information in a compact, fast form, whereby leaving much of the vital message implicit with gaps for the viewer to fill in. Thus, a trailer is intellectually more demanding and requires more focus/attention on part of the viewer than e.g., an ad/video that straightforwardly introduces both the product and its advantages, which, however, makes it more rewarding in the long run. As such, the trailer format proves particularly suited to the kind of information presentation targeted in subtle sophisticated ads. As a matter of fact, our subjects did recognize the trailer format and interpreted much of the formal aspects of the ad (e.g., use of English audio with Norwegian subtitles) in a manner coherent with the trailer format. Thus 72.5% reported having seen the original movie. Across the answers in the whole questionnaire the subjects recognized the video as a parody of a horror movie (17.5%) or a movie trailer (15%) and described it as amusing or comical. However, some misinterpretations did occur as well, along the lines of this being an ad targeting an international audience or reflecting the University international profile.
Like ads at large, this one is also heavily laden with stereotypes. Here are the most prominent stereotypes found in the current ad:
-the Hollywood movie stereotype (a full package, including audio and sub-texting);
-the stereotype of the mad scientist (nerd);
-prominence of male characters: only the male characters act/perform the important tasks, the female character (the girl) is the “victim” (possessed by the Devil and needs to be saved);
-the modern society technological stereotype: machines/equipment prominently present;
-technologized interaction (Hutchby, 2019: 89).
In Panofsky’s (Panofsky, 2018: 76) influential analysis there are 3 levels at which a visual image of the type present in ads is interpreted, the denotative level, with a focus on what is directly perceived (objects, colours etc.), the connotative, with associations that these objects activate, and the level of interpretation, the ideological one. The above stereotypes obviously play a role primarily at the latter, ideological/interpretational level.
Subjects were seated in a comfortable chair in front of the eye-tracker screen at an approximate distance between 50 and 60 cm. The position of the chair and the eye-tracker was adjusted to obtain a good tracking status. Before showing the video a 5 point calibration procedure was performed. In a few cases the calibration was repeated to obtain successful calibration for each point. Once calibration succeeded the video was shown. Subjects were told that a questionnaire was to be filled in later and that they simply had to watch the video. The gazes of both eyes were recorded during the course of the video.
The analyses are based on the average gaze of both eyes. Fixations are defined by consecutive gazes (gaze is measured about 50 times a second) that are less than 30 pixels away from the mean of the previous gazes for at least 100 ms (consecutive gazes should thus stay within about 0.7° of visual angle). The data of two subjects were excluded from the analysis. In spite of successful calibration the number of fixations recorded for these subjects were only 6 and 15 respectively. All other subjects had more than 40 fixations recorded.
The video was segmented for the analysis in two manners. One division was according to the different scenes for the analysis of the visual items. The other division was according to the appearance and disappearance of subtitling and the blending in of other text at the end of the video and served to analyse the processing of text. In each of the sub-divisions Areas of Interest (AOIs) were defined around visual items and around text in order to see which parts of the video subjects fixated.
Part of the analysis is very simple and involves only a consideration of the percentage of subjects fixating at least once in an AOI. This is done for the visual items in the video and for each of the (sub-)titles as a whole. We feel that any further consideration would be severely hampered by the different scenes in the video being of different length, being dynamic, and of having changes in the visibility (of parts) of visual items. However, we feel that it is possible to compare the individual words within each (sub-)title. All words in one subtitle are presented simultaneously, are not dynamic, and are presented an equal amount of time. Although the background on which subtitles appear may be dynamic and changing (even from one scene to another), the change is the same for each word in the (sub)title. Thus, individual words within a (sub-)title are also compared for differences in average fixation length, average number of fixations and (following) average total looking time.
Visual content: The parts of the images fixated by most viewers are clearly the animate characters in the scene, i.e. the possessed girl, the student, and the devil. In particular, the faces of these are fixated. Often all viewers fixated on a face/animate character in particular when it appears in close-up, when it is (almost) the only item in a shot, or when it is the central part of a shot. In shots with more action and/or more than one animate character in the scene the percentage of subjects fixating on (the face of) an animate character may sink somewhat (and may incidentally fall to about 65% of the viewers fixating it at least once). However, all visual elements fixated by half or more of our viewers are (faces of) animate characters with two exceptions. All subjects do fixate at some point on the handling of advanced looking equipment by the student when he examines the possessed girl. Also, the NTNU T-shirt worn by the student is fixated upon by half of the viewers in the final shot. One other visual object fixated relatively often is a non-animate doll (which has a face) that is fixated by almost 40% of our participants. All other visual objects in the scenes were fixated only by around 10% of the subjects or less.
Text content: Regarding fixations on the subtitles, the number of subjects fixating at least once on any part of the text is initially high for the first (
Hun var besatt ‘She was possessed’) and second (
Og den eneste som kunne redde henne… ‘And the only one who could save her…’) subtitles (95% and 89% respectively). Interest in the next two subtitles (…
var en student…’was a student’ and … f
ra NTNU i Trondheim.’from NTNU in Trondheim’) is less (53% and 55% fixate on these), increases somewhat for the fifth subtitle (-
Oj. Du er besatt av Djevelen. ‘Oh, you are possessed by the Devil’, 79%), decreases again on the next (-
NTNU værsågod.’ ‘NTNU here’, 50%), and increases once more on the seventh and final subtitle (-
Kan du gi meg den beste religionshistorikeren vi har? ‘Can you please get me the best religion historian we have?’, 74%).
The central message blended in into the middle of the last scene (
NTNU søker nye studenter ‘NTNU is recruiting new students’) and staying on screen once the final scene is blended out is fixated upon by almost all viewers (97%). The university logo displayed simultaneously in the lower right corner is fixated by only 26% of the viewers, and 50% look at the web-address in the lower left corner.
Areas were defined around the individual words as well. Of the first subtitle
var (was) was fixated by most viewers (58%), but
besatt (possessed) was fixated by almost as many (55%). The mean duration of the fixations on the individual words follows this pattern (275 ms for
var and 269 ms for
besatt) and so does the average number of fixations per subject (.84 and .74 respectively). Of the second subtitle
eneste (the only),
som (who),
redde (save),
den (the) and
henne (her) were fixated by most subjects (42%, 37%, 34%, 32% and 32% respectively), whereas
og (and) was fixated by no one.
However, the average length of a fixation on
redde was 303 ms, while on
eneste,
som,
den, and
henne it was between 191 and 204 ms. Combined with the average number of fixations which range between .32 and .42 for these 5 words, the average total looking time is highest for
redde with 112 ms, while it ranges between 63 and 86 ms for the other words. Of the third subtitle ‘student’ was fixated by most viewers (34%). This word was also fixated longer and more often than the others and thus the average total looking time is higher as well. Of the fourth subtitle
NTNU and
Trondheim share interest in subject numbers (both 24%). Average total looking time is higher for
NTNU (80 ms) than it is for ‘Trondheim’ (42 ms) due to somewhat higher numbers of fixations and fixation duration (.32 and 254 ms versus .24 and 177 ms). Of the fifth subtitle mainly
besatt (possessed),
Du (you) and
Djevelen (the Devil) (47%, 42%, and 37%) were fixated by the viewers.
The same pattern can be found in the average number of fixations (.71, .50, and .39) and in average total looking time (129 ms, 93 ms and 76 ms), and very close to one another with respect to average fixation duration (182 ms, 187 ms and 191 ms). In the sixth subtitle, 34% of the viewers fixated
værsågod (please) and 26% did so for
NTNU. The same pattern is reflected in the average number of fixations, the average fixation length and (thus) the average total looking time. In the seventh subtitle mainly
religionshistorikeren (the religion-historian) and
den are fixated at least once by the viewers (by 53% respectively 45% of them). The average length of a fixation did not matter much (209 ms and 194 ms), but the average number of fixations is higher for
religionshistorikeren (1.24) than for ‘den’ (.47). Correspondingly the average total looking time differs as well (259 ms respectively 92 ms).
Of the end-titling
NTNU søker nye studenter,
søker (recruits),
studenter (students), and
nye (new) were fixated by respectively 92%, 87% and 76% of the participants. Average fixation duration follows the same pattern but do not differ that much (265, 241 and 235 ms respectively), whereas average number of fixations show larger differences in the same direction (2.39, 1.74 and 1.50) resulting in average total looking times of 635 ms, 419 ms and 353 ms for these three words respectively. We address the significance of the word fixations in the discussion section.
Here we describe briefly two groups of questions from the questionnaire, essential to our analysis: questions addressing the interplay of visual elements (images) and language in interpreting the message, and questions addressing how and at what points the message was conveyed. There were 2 questions in the first group, each of them with two parts: a multiple choice part, and a free comment part:
A.What attracted your attention? The possible answers were: (a) the language used, (b) the image, (c) both the language and the image, (d) neither of the two. B.What was more convincing in conveying of the video’s message? The possible answers were: (a) the language used, (b) the image, (c) both the language and the image, (d) neither of the two.Thus, the focus in question A. was on whether language or visual information was more prominent, while in B. the focus was on which of the two conveyed the message more convincingly.
Based on the answers to question (A), what attracted the informants’ attention: for 50% of the informants it was the images, for 42.5% it was both the images and the language, and for 5%, only the language. This gives a total of 92.5% according to whom the visual image was important in attracting the attention to the video vs. 47.5% who meant it was the language that attracted attention. From those who also chose to comment on their responses, comments can be grouped in the following way: 10% of the informants found the images unexpected/ surprising/ strange, and 15% found them disturbing/ unpleasant/ disgusting/ grotesque (these counts include only subjects who mentioned these or related words explicitly).
The following visual elements in the video clip attracted most attention: the possessed girl (mentioned by 22.5%), the composition including things like “special effects”, lighting, camera movement (mentioned by 20%), the student (mentioned by 10%), and the student’s T-shirt with an NTNU print (5%). None of the informants commented on particular features of the girl’s picture, apart from describing it as creepy, scary, grotesque or unpleasant to look at. 10% mentioned textual elements on the screen, referring to text in general (5%) or to the subtitles (5%). All of these images belong in Panofsky’s (Panofsky, 2018: 63) connotational level in activating specific associations to be further used at the level of interpretation.
It is important to emphasise that almost half of the informants recognized the interplay of the visual images with the language. In this respect our results conform to what has been traditionally assumed and claimed for the equal importance of image and language in ads (Dyer, 2019: 86). The question did not specify whether “language” refers exclusively to the audio, or also included the written text, present in the video in a number of elements: the subtitles accompanying the spoken words, the university acronym printed with large letters on the T-shirt of one of the characters, the end text announcing that the university is recruiting students. Some informants commented that the image and the sound in the video make a strange but effective combination. The sound contributed strongly to the impact of the video in several ways: the combination of English audio with Norwegian subtitles, and the exaggerated/ dramatic voice of the narrator clearly formed associations with the horror movie tradition and the trailer format. In addition, there was a comical twist due to the Norwegian accent of the main character and the switching between English and Norwegian, both in the narrator’s and in the student’s speech.
Language plays a prominent role, as revealed in the answers to question B. What was more convincing in conveying the video’s message? 35% of our informants gave the language as more important, compared to 22.5% who chose the image and 27.5% who meant it was both image and language. 12.5% found neither image nor language convincing in conveying the message. Here language comes first with a total of 62.5%, while the visual elements come second with 50%. Observe that while the focus of question A. was on what element was more prominent in attracting attention in general, question B. targeted how different elements fulfilled a particular function, thus explicitly placing the language/visual elements in a particular context. This may explain the discrepancy between the weightings of image and language in the two groups of answers. It is interesting to see the further explication of the roles of both elements in fulfilling the main (advertising) function of the video: many subjects who judged language or both language and image as crucial in conveying the message, commented that the image functioned only as an eye-catcher, without playing a significant role in transmitting the message itself.
Furthermore, some subjects meant that the interplay between the image and the language was very important in a somewhat hierarchical way. Thus, they saw the function of the image as an initial “attention-grabber” with its unusualness added to the comic effect, then to shift the focus over to language, as a central code for the message. Other informants meant, however, that the influence of the image was so strong that it stole the attention from the language, and, thus, also from the message the video wants to convey. 5% of the informants even meant that, in their opinion, everything in the video developed too quickly with too many elements interacting at the same time: image, spoken language, sound, text. For these subjects the image stood out as the strongest element, which occupied all their attention and remained entrenched in their memory.
In summary, for this particular video, there was a mismatch between what attracted the attention of the viewers in general, and what successfully conveyed the message. Comparing the answers of questions A. and B., we can say that the image attracted more attention in general, while it was the language that was prominent in conveying the video’s message.
The interplay between image and textual elements is even clearer in the answers to the free-response question “What does the video advertise for and at what point during watching the video did you realize this?” Of these three scenes the first one was seen as most important, and was explicitly named by 40% of the subjects, and this is the scene with a complex event structure. According to the informants’ answers, four types of elements stand out as crucial for conveying the main message of the video: The first one we label
Action, referring to any mention of a complex event in the video (the student entering the room, the student calling for help etc.). We label the second element
Sound, by which we mean any verbal message in the audio input (the voice of the narrator in the video). The third element we call
Simple Image, meaning separate visual elements mentioned by the informants, such as e.g., the T-shirt with a university logo or the logo itself. The last type of element is
Text appearing on the screen independently from the sound or simple images. This is how these elements are ranked according to their occurrence in the answers to question D:
Simple Image scores best with 47.5%, of which 45% is the T-shirt with university logo print and 2.5% university logo on its own.
Action follows close with 45% (of which 40% are references to the student entering the scene, and 5% the student calling to get help. 35% identify
Sound as an important element, and 12.5% mention
Text. Each of these four elements can appear in the answers either alone, or in an interaction with other elements. All interactions are between elements in the scene where the student is introduced for the first time (scene a) or between elements in that scene and the end text where the main intention of the advertisement is introduced explicitly. Most of the informants who mentioned that seeing the end text was important for them, also said that the T-shirt gave them a hint, later consolidated by the end text.
There are two types of visual elements in the experimental video: non-text images (the characters, diverse objects in the scene, colours, lighting, etc.) and text (subtitles, end text). Non-text images we treat standardly as images, while the subtitles are treated as text, and thereby language. In this respect there are two elements that are difficult to categorize: the student’s T-shirt and the university logo. The T-shirt is exclusively mentioned in connection with the letters NTNU (The University acronym) printed across its front. Here the T-shirt is important only as a medium for this text, and should be treated on the TEXT side, rather than the image side, with possibly an intermediary role between the two visual domains.
Further features that make analysing the data more difficult is the dynamic nature of most scenes with complex actions with a complex event-structure (Zacks & Tversky. 2019: 73). This becomes evident also in responses to the questionnaire in that subjects often mention the combined effect of action and other elements (e.g. sound, a simple image, text). In the ad-analysis tradition, more often than not, the focus is on simple still images, thus making a comparison between the two basic types of elements, text and image more homogeneous and at the same level. We leave further analysis of complex actions (events) to future work.
As already mentioned, visual information plays an important role in attracting attention, particularly what we refer to as
Simple Images. Actions, however, appear more important in conveying the message, most often in conjunction with language (the combination of text and sound). This was clearly supported both by the eye-tracking data and the responses in the questionnaire. Of the two, language is the more important in interpreting the ad and its intentions. In regard of written text (e.g., the subtitles) we are aware that the eye-tracking data alone cannot be crucial, due to the well-known automatic nature of reading (e.g. as evident in the Stroop effect). Thus subjects may have fixated on the subtitles simply because they cannot help it.
However, there are two facts that speak against this hypothesis. Firstly, subjects explicitly state in their responses that they were influences by what they read, i.e. they actually processed the contents of the text. Secondly, fixations on the text in the subtitles are by no means on all words. Rather, as already shown in the eye-tracking data, subjects look more at content words, and not even all content words, but words that are semantically crucial to the message.
Thus, words that were most looked at were
besatt (possessed),
redde (save),
religionshistorikeren (the religion historian), as well as the end-line
NTNU søker nye studenter (NTNU is recruiting new students). This attention pattern conforms largely with what is known independently from language acquisition, early language production and language processing (Field, 2021: 39), in that content words are more prominent in all of these processes. In this respect our study presents an interesting and somewhat unexpected finding.
Along with fixating on the noun
religionshistorikeren, subjects also fixated on the pre-posed definite article
den. This is the pattern for the rest of noun phrases in the video, there are systematic fixations first on the article (definite or indefinite) and later on the head noun. To the extent that the article is a function word, this comes as a surprise. However, this only happens with the pre-posed article, and never with the enclitic article. In recent syntactic analyses of nominal expressions (Dimitrova-Vulchanova & Giusti, 2023: 119; Giusti, 2023: 81) it has been suggested that determiners of this type (pre-posed articles and demonstratives) define the outer boundaries of nominal expressions, and as such, appear to be crucial in processing as well. Clearly these findings call for a dedicated investigation of the processing of nominal expressions.
The eye-tracking data revealed that our viewers direct a considerable amount of attention to faces. This is not surprising given the importance of faces in everyday life. On a day-to-day basis we see many faces and amongst these we need to recognise family and friends. Faces are an important transmitter of nonverbal communication. They express the emotional state of the people around you (angry, happy, or, indeed, possessed) and they are important in identifying other characteristics of the face's owner (e.g. age, gender, attractiveness, and health). It is known that very young infants already show a preference for facial features (Goren et al.: 2018: 539; Ellis, 2023: 108; Nelson, 2023: 16). In visual search the presence of a face among the distractors makes finding a target more difficult (Langton et al, 2007: 18; Hershler & Hochstein, 2018: 326; Hershler & Hochstein, 2023: 108; Rullen, 2019: 326).
These and other findings have led some to suggest that there is even a special faculty in the brain especially dedicated to face recognition that is also thought to be involved in prosopagnosia, a specific impairment in face recognition (Kanwisher et al., 2018: 137). However, this view is not uncontroversial and it is argued by some that the proposed brain region is general purpose and used for other specialised recognition as well (Gautier et al., 2019: 578). Whatever the nature of the mechanism behind the attractiveness of faces, it is clear that faces attract a good deal of attention not only in our study. For instance, in usability studies with eye-tracking, it is found that faces can be used to attract attention to but have also been found to pull attention away from the content of ads and websites (Riegelsberger et al., 2018: 316).
In addition, we have isolated an important element in the format of dynamic videos, displaying a complex event structure, namely complex images which we here provisionally label
Action. The latter clearly call for special attention and a more rigorous analysis. Further, the relationship and interaction between
Simple Image and
Text as instances of visual information can be elaborated in future research.
References
1. Dimitrova-Vulchanova M.K. & Giusti G.S. Fragments of Balkan nominal structure. In: Alexiadow A.K. & Wilder C.T. Possessors, predicates and movement in the determiner phrase. - Amsterdam/Philadelphia: John Benjamins, 2023. – P. 118-169.
2. Dyer G.P. Advertising as communication. – London-New York: Routledge, 2019. – 186 p.
3. Ellis H.D. The development of face processing skills. Philosophical transactions // Biological Sciences. – 2023. – Vol. 35. – P. 105-111.
4. Field J.S. Psycholinguistics. - London & New York: Routledge, 2021. – 389 p.
5. Gauthier I.S., Tarr M.J., Anderson A.W., Skudlarski P.T. & Gore J.C. Activation of the middle fusiform ‘face area’ increases with expertise in recognizing novel objects // Nature Neuroscience. – 2019. - Vol. 2. – P. 568 – 580.
6. Giusti G.K. Parallels in clausal and nominal periphery. In: Frascarelli M.T. Phases of interpretation. – Berlin: Mouton de Gruyter, 2023. – P. 163-184.
7. Goren C.C., Sarty M.S. & Wu P.K. Visual following and pattern discrimination of face-like by new-born infants // Pediatrics. – 2018. – Vol. 56. – P. 538-549.
8. Hall S.T. & Whannel P.R. The popular arts. - London: Hutchinson, 2021. – 189 p.
9. Hershler O.T. & Hochstein S.K. At first sight: A high-level pop out effect for faces // Vision Research. – 2023. – Vol. 45. – P. 107-124.
10. Hershler O.T. & Hochstein S.K. With a careful look: Still no low-level confound to face pop-out // Vision Research. – 2018. – Vol. 46. – P. 318-326.
11. Hutchby I.P. Conversation and technology. - Cambridge: Polity Press, 2019. – 376 p.
12. Kanwisher N.G., McDermott J.T. & Chun M.M. The fusiform face area: A module in human extrastriate cortex specialized for face perception // Journal of Neuroscience. – 2018. – Vol. 11. – P. 130-138.
13. Langton S.R., Law S.A., Burton A.M. & Schweinberger S.R. Attention capture by faces // Cognition. - 2007. - № 8. – P. 16-27.
14. Nelson C.A. The development and neural bases of face recognition // Infant and Child Development. – 2023. – Vol. 10. - P. 3-16.
15. Panofsky E.T. Meaning in the visual arts. - Harmondsworth: Penguin, 2018. – 167 p.
16. Riegelsberger J.P., Sasse M.A. & McCarthy J.T. Eye-catcher or blind spot? The Effect of photographs of faces on e-commerce sites. - Lisbon, 2018. – 326 p.
17. Rullen R.T. On second glance: Still no pop-out effect for faces // Vision Research. – 2019. – Vol. 46. – P. 317-327.
18. Zacks J.M. & Tversky B.K. Granularity in taxonomy, time and space. – New York, 2019. – 239 p.