We use thousands of words every day, with meanings of all kinds and belonging to very varied grammatical categories. However, not all of them are used with the same frequency. Depending on how important they are to the structure of the sentence, there are words that are more recurring than others.
Zipf’s law is a postulate that takes this phenomenon into account and specifies how likely a word is to be used based on its position in the ranking of the total number of words used in a language. Below we will go into more detail about this law.
Zipf’s law
George Kingsley Zipf (1902–1950) was an American linguist, born in Freeport, Illinois, who encountered a curious phenomenon in his studies of comparative philology. In his work, in which he was carrying out statistical analyses, he found that the most used words seemed to have a pattern of appearance this being the birth of the law that receives his surname.
According to Zipf’s law, in the vast majority of times, if not always, The words used in a written text or in an oral conversation will follow the following pattern : the most used word, which would occupy the first place in the ranking, would be twice as many times as the second most used, three times as many times as the third, four times as many times as the fourth, and so on.
In mathematical terms, this law would be:
Pn ≈ 1⁄na
Where ‘Pn’ is the frequency of a word in order ‘n’ and the exponent ‘a’ is approximately 1.
It should be said that George Zipf was not the only one who observed this regularity in the frequency of the most used words of many languages, both natural and artificial. In fact, there is evidence that it was others, such as the steganographer Jean-Baptiste Estoup and the physicist Felix Auerbach.
Zipf studied this phenomenon with texts in English and, apparently, it holds true. If we take the original version of Charles Darwin’s Origin of Species (1859) we see that the most used word in the first chapter is “the”, with an occurrence of about 1,050, while the second is “and”, appearing about 400 times, and the third is “to,” appearing about 300. Although not exactly, you can see that the second word appears half as many times as the first and the third a third.
The same thing happens in Spanish If we take this same article as an example, we can see that the word “of” is used 85 times, being the most used, while the word “the”, which is the second most used, can be counted up to 57 times.
Seeing that this phenomenon occurs in other languages, it becomes interesting to think about how the human brain processes language. Although there are many cultural phenomena that mediate the use and meaning of many words, the language in question being a cultural factor in itself, the way in which we make use of the most used words seems to be a factor independent of culture.
Frequency of function words
Let’s look at the following ten words: ‘that’, ‘of’, ‘not’, ‘a’, ‘the’, ‘the’, ‘is’, ‘and’, ‘in’ and ‘it’. what do they all have in common? Which are words without meaning on their own but, ironically, They are the 10 most used words in the Spanish language
By saying that they have no meaning we mean that, if a phrase is said in which there is no noun, adjective, verb or adverb, the phrase has no meaning. For example:
… and … … in … … a … of … … to … of … …
On the other hand, if we replace the dots with words with meaning, we can have a phrase like the following.
Miguel and Ana have a little brown table next to their bed at home.
These widely used words are what are known as function words, and They are responsible for giving grammatical structure to the sentence They are not just the 10 that we have seen, in fact there are dozens of them, and all of them are among the hundred most used words in Spanish.
Although they have no meaning on their own, They are impossible to omit from any sentence that you want to give meaning to It is necessary that human beings, in order to transmit a message efficiently, use words that constitute the structure of the sentence. For this reason they are, curiously, the most used.
Investigation
Despite what George Zipf observed in his studies of comparative philosophy, Until relatively recently it had not been possible to empirically address the postulates of the law Not because it was materially impossible to analyze all the conversations or texts in English, or any other language, but because of the titanic task and the great effort it involved.
Fortunately, and thanks to the existence of modern computing and computer programs, it has been possible to investigate whether this law occurred in the form in which Zipf originally proposed it or if there were variations.
One case is the research carried out by the Center for Mathematical Research (CRM, in Catalan Center de Recerca Matemàtica) linked to the Autonomous University of Barcelona. Researchers Álvaro Corral, Isabel Moreno García and Francesc Font Clos carried out a large-scale analysis in which they analyzed thousands of digitized texts in English to see how true Zipf’s law was.
His work, in which an extensive corpus of nearly 30,000 volumes was analyzed, allowed us to obtain a law equivalent to Zipf’s in which it was seen that the most used word was twice as used as the second, and so on.
The Zipf law in other contexts
Although Zipf’s law was originally used to explain the frequency of words used in each language, comparing their range of appearance with their actual frequency in texts and conversations, it has also been extrapolated to other situations.
A quite striking case is the number of people living in capital cities of the United States According to Zipf’s law, the most populous American capital had twice as much as the second most populous, and three times as much as the third most populous.
If you look at the 2010 population census, this agrees. New York had a total population of 8,175,133 people, with the next most populated capital being Los Angeles, with 3,792,621 and the following capitals in the ranking, Chicago, Houston and Philadelphia with 2,695,598, 2,100,263 and 1,526,006, respectively.
This can also be seen in the case of the most populated cities in Spain, although Zipf’s law is not completely fulfilled but it does correspond, to a greater or lesser extent, with the rank that each city occupies in the ranking. Madrid, with a population of 3,266,126, is twice as large as Barcelona, with 1,636,762, while Valencia has about a third with 800,000 inhabitants.
Another observable case of Zipf’s law is with web pages Cyberspace is very extensive, with nearly 15 billion web pages created. Taking into account that there are about 6.8 billion people in the world, in theory for each of them there would be two web pages to visit every day, which is not the case.
The ten most visited pages currently are: Google (60.49 million monthly visits), YouTube (24.31 million), Facebook (19.98 million), Baidu (9.77 million), Wikipedia (4.69 million), Twitter (3.92 million), Yahoo (3.74 million), Pornhub (3.36 million), Instagram (3.21 million) and Xvideos (3.19 million). Looking at these numbers, you can see that Google is twice as visited as YouTube, three times as many as Facebook, more than four times as many as Baidu…