Steven Mithen has noted that language and music have three things in common: they may be vocal, they may be gestural, and they can be written down. Vocalisation must be one of the earliest forms taken by music, and no study of music can ignore that it is highly gestural, e.g. people find it hard not to tap along to the rhythm. Writing and musical notation appeared in the early civilisations and are relatively late innovations.
To use the terminology of the neuroscientist Steven Brown, who has made a particular study of music, both music and language involve a limited repertoire of discrete building blocks, organised into phrases and higher-order structures using combinatorial rules.  Put simply, both organise individual acoustic elements using a kind of grammar, which can be built into larger structures such as novels and symphonies. For both we also use expressive phrasing, where we modulate acoustic properties such as pitch and rhythm “for the purposes of conveying emphasis, emotional state, and emotive meaning.” In both music and language, certain modes of symbolic expression are the same across all cultures and intensify along with the emotion: happiness is expressed through faster, louder and higher music or speech, sadness by the opposite. Dean Falk, an anthropologist who specialises in the evolution of the brain, tells us that language and music are “neurologically intertwined”, and has concluded that “they began evolving together by two million years ago”.
Although they are similar, music and language are also distinct. Language conveys referential meaning, i.e. a symbolic content based upon the arbitrary association of words with specific referents. Music, on the other hand, has difficulty conveying such meaning — its emphasis is on the emotive, and it is therefore ‘manipulative’, moving our emotions and bodies. This divide is one of emphasis only. Language often uses onomatopoeia to emulate sounds, such as the English “miaow” or “tick tock”. One couldn’t easily devise music to convey the sense, “Please go and fetch the first cup on the left,” but music is not incapable of direct semantic representation. The imitation of a cuckoo in Beethoven’s sixth symphony is a famous example; or, in musical narrative, certain phrases can be associated with certain characters or concepts, as in the operatic use of leitmotif. Both language and music often use gesture, and language would not be so effective without the ‘musical’ changes of pitch, volume etc that add layers of meaning to speech.
Broadly, in language and music we have structurally similar forms of communication that use two different systems: sound reference and sound emotion. It could be that their similarities are a pure accident, or that music grew out of language or vice versa. Brown prefers another option — that “music and language must converge at some deep level to have hierarchical organisation flower from two such different grammatical systems.”
Whereas language has a clear purpose in conveying information, the ‘use value’ of music is less obvious. Steven Pinker described music as “auditory cheesecake, an exquisite confection crafted to tickle the sensitive spots of at least six of our mental faculties”  — that is, a by-product of adaptations that evolved for other purposes. A slice of cake pleases our taste buds, not because cake was essential to our evolution, but because our ancestors evolved a liking for fat and sugar as excellent energy sources. Music would thus be what Gould called a ‘spandrel’: an accidental offshoot of more significant processes. This wouldn’t stop music being glorious, but it would mean that it was not an evolutionary adaptation. We shall address the question of adaptive value as we go on.
An ancestral proto-language
Steven Brown explains the similarities between music and language by suggesting they originate in a single, ancient form of vocal communication that he terms ‘musilanguage’: “an ancestral stage that was neither linguistic nor musical but that embodied the shared features of modern-day music and language.” Musilanguage combined sound as emotive meaning and as referential meaning. Only later in human development did music and language separate onto different paths.
This is not a new idea. The Swiss philosopher Jean-Jacques Rousseau, for example, wrote in 1754 that “verse, singing and speech have a common origin... one spoke as much by natural sounds and rhythm as by articulations and words”. Charles Darwin, whose ideas on language and music remain relevant today, suggested it in 1871 in a context of sexual selection:
When we treat of sexual selection we shall see that primeval man, or rather some early progenitor of man, probably first used his voice in producing true musical cadences, that is in singing, as do some of the gibbon-apes at the present day; and we may conclude from a widely-spread analogy, that this power would have been especially exerted during the courtship of the sexes, — would have expressed various emotions, such as love, jealousy, triumph, — and would have served as a challenge to rivals. It is, therefore, probable that the imitation of musical cries by articulate sounds may have given rise to words expressive of various complex emotions.
In the 1920s, the linguist Otto Jespersen also championed the idea. But it had to wait until the last couple of decades to receive consistent academic attention. In The Singing Neanderthals, one of the most prominent recent books on the origins of music, Mithen argues that music and language have common origins in a quasi-musical ‘proto-language’, i.e. a musilanguage (the term I shall keep to for simplicity). Drawing upon archaeology, genetics and linguistics, Mithen follows linguist Alison Wray in proposing that
the precursor to language was a communication system composed of ‘messages’ rather than words; each hominid utterance was uniquely associated with an arbitrary meaning... modern language only evolved when holistic utterances were ‘segmented’ to produce words which could then be composed together to create statements with novel meanings.
This holistic musilanguage would have made “extensive use of variation in pitch, rhythm and melody to communicate information, express emotion and induce emotion in other individuals”. A modern equivalent to how this worked might be proverbial phrases like “don’t count your chickens before they’re hatched”, which we understand in a holistic way rather than according to their individual parts.
Steven Brown listed three essential features of musilanguage: 1. the use of pitch to convey semantic meaning, 2. the creation of phrases by combining “lexical-tonal elements”, and 3. use of expressive phrasing to give emotional emphasis.
Mithen refers to musilanguage by the (slightly annoying) acronym ‘Hmmmmm’, because it is holistic (made of complete meaningful phrases rather than discrete parts), manipulative (influencing our emotional states and behaviour), multimodal (using both sound and movement), musical (temporal, rhythmic, and melodic), and mimetic (making use of sound symbolism and gesture). Pre-sapiens human species could have possessed such a musilanguage and thus had a musical capacity as well, but true language was probably limited to ourselves, so we may speculate that the separation of musilanguage into two distinct forms of communication was confined to our own species.
Mithen suggests we may hear an echo of musilanguage in the special speech we use for babies, the so-called “infant-directed speech” or “parentese”. This form of speech, which is universal in human cultures, doesn’t rely on proper words, but employs extended vowels, exaggerated pauses, and wider ranges of pitch to allow us to communicate with infants. This works because we are sensitive to the tempos and rhythms of speech long before we have learnt words (and we begin to hear before we even leave the womb). The exaggerated discourse of infant-directed speech probably helps children to pick up how words are formed and sentences constructed. But as it can communicate with infants regardless of the mother tongue of the person using it, it seems to be less about language as such and more about emotive and intentional communication through expressive phrasing. Babies can communicate non-verbally long before they can talk (and the importance of non-verbal sound stays with us into adulthood).
This idea too is not Mithen’s own. It was Helen Dissanayake who hypothesised that
it is in the evolution of affiliative interactions between mothers and infants — not male competition and adult courtship — that we can discover the origins of the competencies and sensitivities that gave rise to human music.
We should amend her idea by pointing out that this interaction can be practiced by mothers and fathers, i.e. childcare is not a uniquely female destiny. Dissanayake’s idea is that human groups found the capacities encouraged in parent-infant interaction to be useful both emotionally and functionally in social rituals. Dean Falk has suggested something similar in her book Finding Our Tongues.
We don’t only talk with our infants — we sing to them too, to soothe them, reassure them of our presence, and to make them smile. The psychologist Sandra Trehub has demonstrated that lullabies, like infant-directed speech, sound remarkably similar across cultures. The universality of baby talk and lullabies, which are used instinctively by adults, strongly implies a genetic aspect to both. Mithen’s comment is insightful: “Those who use facial expressions, gestures and utterances to stimulate and communicate with their babies are effectively moulding the infants’ brains into the appropriate shape to become effective members of human communities.” So musilanguage may have conferred an evolutionary advantage through improved socialisation.
Drumming behaviour in chimpanzees, bonobos and gorillas — on the ground, on their chest, or on objects like trees — may represent a distant connection to our own past through a common ancestor some 7–8 million years ago.
Great apes use barks, hoots, screams and other noises to communicate all sorts of messages — what members of the group are up to, the approach of strangers, competitive displays, etc — and sometimes accompany their vocalisations with vigorous movements like shaking branches or stamping. Such behaviours are manipulative, in the sense of trying to get a response from other apes, rather than referential, i.e. there is nothing that resembles words. They are the possible distant precursors of music, language and dance.
John S. Allen has pointed out:
If we consider the various functions that form the basis of music’s claim to be an adaptation — courtship display, emotional communication, synchronising group behaviour, a mnemonic device for carrying information, and so on — all of these functions are at least as well served by language as by music, which would certainly limit the potential fitness benefits that might accrue with musical expertise. So in order for music to be considered an adaptation in these domains, it would have to have enhanced prelinguistic expression in earlier hominids.
The differences of anatomy between the hominids and other apes as a result of our shift to meat-eating, such as reduced size of the teeth and jaws, allowed us to increase our vocal range. This capacity was also affected by bipedality, which lowered our larynx as well as expanding our brain and nervous system. Increasing group size probably demanded an increase in the quality and quantity of calls. Drawing on Robert Dunbar’s research on the development of language through vocal instead of physical grooming, Mithen speculates that early hominids ‘sang’ to each other, reinforcing their social relationships:
One might have heard predator alarm calls; calls relating to food availability and requests for help with butchery; mother-infant communications; the sounds of pairs and small groups maintaining their social bonds by communicating with melodic calls’ and the vocalisations of individuals expressing particular emotions and seeking to induce them in others.
Mithen then imagines the whole group, at the close of the day, engaged in group song.
Musilanguage therefore would have arisen among early hominids and died out with the Neanderthals. It is worth stressing that true music is a uniquely human activity. Tracing music’s lineage back to great ape communication does not mean that there is anything musical about that communication, only that it provided a foundation for a human behaviour. Some animal species are capable of very complex vocalisations which to us sound musical (the obvious example being songbirds). But we must be cautious about making lazy, anthropocentric connections. Apes’ capacity for vocal learning is much poorer than our own. Humans are separated from chimps, our nearest surviving ape ancestors, by six million years of evolution, and from birds, some of whose vocal learning is considerably better than chimps’, by far longer again. It is possible to teach a degree of symbolic behaviour to apes, but only under controlled conditions. Animals miss the vital ingredient of human consciousness, and so their communication does not contain the rich meanings of human signs, nor do they have the creative ability to break out of instinctive patterns. Recognising a phylogenetic aspect to music is one thing; assuming a gradualist development from animal behaviours that ignores the uniqueness of humans is another.
Mithen finds another possible source of musical behaviours, for which he consults the work of the evolutionary psychologist Merlin Donald. Early human species may have possessed a culture in-between that of apes and modern humans, what Donald called “mimetic”. Mimesis, or imitation, would in this context have been deliberate representational behaviour that didn’t use words, which may have played an important role in communicating encounters with new animals and environments as humans spread out of their traditional African homelands.
The very earliest attempts at what became music may have been people observing and trying to imitate the sounds of the natural world — as practised by surviving hunter-gatherer societies — or of human labour, such as the rhythmic knapping of flints. These imitations, predominantly created using the voice, could then be repeated, structured, combined, and added to social rituals such as attaining adulthood, burial and so on. Through imitation, early humans could “create new types of tools, colonise new landscapes, use fire and engage in big-game hunting” (Mithen).
Such mimetic behaviour would have operated as holistic, self-contained messages. Hence the title of Mithen’s book — the Neanderthals, although not possessing true language, would have had a rich repertoire of holistic phrases, and therefore some measure of musicality.
Once humans had the ability to vocally imitate the sounds of nature and their own activities, and accompany these vocal signs with gestural ones, we had the seeds of language. This would often have been onomatopoeic: a buzzing noise intended to imitate a bee, for example, could become a word which meant ‘bee’. Another means of language development is ‘sound synaesthesia’, a more precise term for ‘sound symbolism’, wherein the sound not only mimics the animal’s own calls, but tries to capture something of its nature. An example is the use of long sounds like ‘oh’ and ‘ah’ to represent a large, heavy animal. Such sounds could be accompanied by gestures, or mime. This process proceeded in a dialectical spiral, the physiological and lexical assisting one another.
Mimesis is one of the most persuasive theories for the beginnings of music, potent whether or not one subscribes to the existence of an ancestral musilanguage.
From musilanguage to language and music
One of the challenges for proponents of musilanguage is to explain how it made the leap to modern music and language, and when in human history that took place.
According to Brown’s model, the division arose because musilanguage had two aspects that gradually became separate specialities. The first, symbolic communication emphasising referential meaning, became language. The other, emphasising emotional communication, became music.
Explaining how individual words, or sounds, came to be organised into phrases — i.e. grammar, both lexical and musical — and into extended structures is difficult. Proponents of proto-languages, such as Wray and Arbib, have suggested that holistic phrases could have been gradually broken up into elements of meaning, giving us sentences and then nouns, verbs etc. This could have begun with the recognition of chance associations between particular bits of holistic phrasing and their referents: if a particular phoneme appeared in two phrases which both made some reference to a deer, that phoneme could be held up as an arbitrary label for ‘deer’. Instead of developing individual words and only later a grammar to stick them together with, early humans slowly broke down ‘sung’ phrases into discrete parcels with a precise meaning. The isolation of individual musical notes, together with ways of stitching them together with a musical grammar, could have followed the same course. Freed by language of the need to relay information, musilanguage could become music, concentrating upon communicating emotion and reinforcing group identity.
When precisely this split happened is an open question, but it must have been complete  by the time of the flowering of art 50,000 years ago. Going by the archaeological evidence of actual musical instruments, true music, despite its beginnings among earlier hominids, matures for certain only with Homo sapiens.
“Discrete words, that can be combined to make new and unique utterances,” Mithen concludes, “were a relatively late development in the evolutionary process that led to language.”
Music emerged from the remnants of ‘Hmmmmm’ after language evolved. Compositional, referential language took over the role of information exchange so completely that ‘Hmmmmm’ became a communication system almost entirely concerned with the expression of emotion and the forging of group identities, tasks at which language is relatively ineffective. Indeed, having been relieved of the need to transmit and manipulate information, ‘Hmmmmm’ could specialise in these roles and was free to evolve into the communication system that we now call music. As the language-using modern humans were able to invent complex instruments, the capabilities of the human body became extended and elaborated, providing a host of new possibilities for musical sound.
These possibilities were indeed immense, because music evolved as a form of communication. How this communication is made and how it is used can take an infinite number of forms: from handclapping in the playground to an orchestra trumpeting the destiny of humankind.
Mithen sharply disagrees with Pinker’s assessment of music as a non-adaptive ‘extra’. Emotions, he argues, are deeply rooted in our evolutionary past and in our physiology. Why would music affect them so strongly if it was a recent and superficial innovation? In his view the development of music from musilanguage did have adaptive value, enabling communication with infants and group bonding. These behaviours were firmly established with Homo sapiens before our migration from Africa, which explains their universality today.
Mithen’s book tends to make speculative claims without adequate evidence. He says for example that the development of language meant a certain loss of musicality, but presents no strong evidence of this. If modern musical practice often diverges from traditional collective and participatory models, this is not the fault of language but of capitalism, which has commodified, individualised and technologised music-making, dividing it up into categories and specialisations. In many non-Western, and indeed Western contexts, human beings continue to participate in music spontaneously and collectively much as our ancestors did. His conclusion that the Neanderthals must have been at some level more musical than ourselves seems completely unprovable.
Another shortcoming of Mithen’s book is his commitment to modularity in the brain. Mithen himself notes that the way music processing is distributed through the brain shows that there cannot be a single ‘music module’, so he then has to suggest an increasing number of different modules dealing with different aspects of music. Brain scans have shown that the neural processing of music takes place in many different areas. Mithen acknowledges this, suggesting that modularity does not have to be limited to discrete areas, but he is unable to prove the location or existence of any modules.
These shortcomings don’t prove wrong the theory of musilanguage, which remains one of the most serious attempts to solve the mystery of music’s origins. We will consider some of the other adaptive theories in the next post.
 Steven Brown, ‘The Musilanguage Model of Language Evolution’, from Wallin, Merker and Brown (eds), The Origins of Music (2000).
 Dean Falk, ‘Hominid Brain Evolution and the Origin of Music’, from Wallin et al, op. cit.
 Steven Pinker, How the Mind Works (1997).
 Rousseau, Essay On the Origin of Languages (written 1754, pub. posthumously in 1781).
 Darwin, Chapter 3 of The Descent of Man, and Selection in Relation to Sex (1871). For a discussion of Darwin’s ideas see W. Tecumseh Fitch, ‘Musical protolanguage: Darwin’s theory of language evolution revisited’ from Language Log (2009).
 Steven Mithen, The Singing Neanderthals (2005).
 Ellen Dissanayake, ‘Antecedents of the temporal arts in early mother-infant interaction’, from Wallin, Merker and Brown (eds), The Origins of Music (2000).
 John S. Allen, The Lives of the Brain: Human Evolution and the Organ of Mind (2009).
 Mithen, op. cit.
 Though of course musical elements live on in language, via holistic idioms, infant-directed speech, and the persistence of onomatopeoia, sound synaesthesia etc.
 Mithen, op. cit.
 To be fair it is one of the benefits of popular science writing, as opposed to strictly scientific papers, that one may engage in creative speculation of this kind.