There are a few equations I put in this post using a new trick I learned recently. Sometimes, the equations look really tiny. If that happens to you, just hit the refresh button, and they should be big again. The issue may just be my browser.
One of the most interesting things I've read or heard about concerning the mathematical obstacles to evolution is an argument that says the probability of getting just one functional protein of average length in the entire history of the universe, even given unrealistically generous probablistic resources, is so vanishingly small, that it's not reasonable to believe that new proteins could form through undirected natural processes.
One particularly good presentation of an argument like this is "The Statistical Case Against Evolution" by Paul Scott Pruett. He notes that there are examples of convergent evolution, not just on the macro scale, but on the scale of genes and proteins as well. That means nature seems to aim at particular target proteins. The odds of getting particular target proteins are much smaller than the odds of getting just any functional protein, yet nature seems to produce the same target proteins over and over.
I also saw this video on YouTube, but based on what Scott told me, it's a little sketchy. Scott thought some of his assumptions were either too generous or just arbitrary. I have issues with this presentation, too, but it's at least easy to understand.
I have been skeptical of this argument for a number of reasons, but I thought it might be interesting for me to try to work through the line of reasoning myself and see what I come up with. While trying to work through it, I came up against some unanswered questions that prevented me from completing the argument. I've put this blog post on hold and revisited it from time to time over the last few years, but now I thought I'd just make a post about where I'm at. Maybe if I posted about my unanswered questions and why I think they are relevant, somebody will have something to say about them.
So, here we go.
An amino acid is an organic molecule made of hydrogen, carbon, oxygen, and nitrogen. There are 20 different kinds of amino acids that make up the proteins in all life on earth. Proteins are strings of amino acids. Depending on the length of these sequences and their order, proteins can be folded into stable shapes. The shapes of the proteins are what give them their function. You can think of them like car parts.
The amino acids can be strung together in any order. You can think of them like letters in the alphabet. You can string a bunch of letters together in any order. Just as some of those arragements will produce gibberish while other arrangements will produce coherent words and sentences, so also some sequences of amino acids can be folded into stable functional proteins, and others cannot.
Proteins come in different lengths, but the average protein is about 200 amino acids long. Since each position along the string could contain any of 20 different amino acids, there are 20200 possible sequences in a protein that's 200 amino acids long.
If you were to randomly pick out a sequence of amino acids 200 units long, the odds that you would get any one particular sequence would be 1 in 20200, which is pretty small. However, there are more things to consider in this argument, which brings me to some of my unanswered questions.
Any low probability can be overcome if you have enough chances for it to happen. If you were trying to guess the combination of a lock, you may have a 1 in a million chance of making the right guess on the first try, but if you tried a million times, you'd have a good chance of guessing correctly in at least one of those tries.
Let's suppose you want to aim for a particular sequence of amino acids 200 units long. You start on day one at the beginning of the universe, and you do one try per second continuously until today. There's no need to be precise, so roughly. . .
13.8 billion years x 365 days/year x 24 hours/day x 60 minutes/hour x 60 seconds/minute = 4.35 x 1017 seconds
With that being the case, what are the chances of getting the correct sequence if you made one attempt every second for 13.8 billion years?
Let me make a detour here and clarify something I used to be confused about.
If you have a six sided dice, and you rolled it, the odds of getting any given number would be 1 in 6, right? So you'd think that if you rolled it six times, the odds of getting any given number would be 1. In other words you'd be guaranteed to get the right number. But that obviously isn't right because it's possible to roll it six times and never get a 2. So here's the correct way to figure out the probability of getting a 2 if you rolled it six times.
Each time you roll the dice, you have a 5 in 6 chance of not getting a 2. So if you roll it six times, the probability of not getting a 2 on all six rolls can be given by,
\[
\normalsize
\left(\frac{5}{6}\right)^6
\]
Since that's the probability of
not getting a 2 in the six rolls, you can subtract that number from 1 to get the probability that you
will get a 2.
\[
\normalsize
1 - \left(\frac{5}{6}\right)^6
\]
That comes out to a 66.5% chance, or about 1 in 1.5, which is obviously lower than a 100% chance.
Now, let's apply that same reasoning to figure out what the chances are of getting our target protein. Since the chances of getting the right sequence in one try is 1 in 20200, the chance of not getting the right sequence is,
\[
\normalsize
\frac{20^{200} - 1}{20^{200}}
\]
And the chances of getting the right sequence in 4.35 x 10
17 tries is,
\[
\normalsize
1 - \left(\frac{20^{200} - 1}{20^{200}}\right)^{4.35 \times 10^{17}}
\]
Unfortunately, my dinky calculator can't handle those kinds of numbers, but if you think about it, the probability is really small. That means if you were to make one random attempt every second for 13.8 billion years to get a particular sequence of amino acids 200 units long, there's almost no chance that it would happen. But I would love to see the actual number.
We can improve these odds, though. We know that in reality, there could be tries happening simultaneously all over the universe each second (relativity of simultaneity notwithstanding). Let's imagine some really generous probablistic resources to improve our odds.
The internet estimates that there are 1050 atoms in the earth. Let's suppose all these atoms are just hydrogen, carbon, oxygen, and nitrogen, that they are currently part of amino acid molecules, and that they are all joining in the effort to make our target protein. The average amino acid contains 10 atoms, so there would be about 1049 amino acids that make up the earth.
\[
\normalsize
\frac{10^{50} \, \text{atoms}}{10 \, \text{atoms/amino acid}} = 10^{49} \, \text{amino acids}
\]
The internet also estimates that there are 200 billion trillion stars in the universe. That's
200,000,000,000 x 1,000,000,000,000 = 2 x 10
23 stars. Let's imagine there are two earth-like planets for each star, and they have all existed for the entirety of the 13.8 billion years of the universe. That's 4 x 10
23 earth-like planets, all trying to make this one protein.
In that case, there would be 4 x 1072 amino acids available to make proteins.
\[
\normalsize
10^{49} \, \text{amino acids/planet} \cdot 4 \times 10^{23} \, \text{planets}= 4 \times 10^{72} \, \text{amino acids}
\]
Since each try uses up 200 amino acids, there are 2 x 10
70 tries going on each second.
\[
\normalsize
\frac{4 \times 10^{72} \, \text{amino acids}}{200 \, \text{amino acids/try}} = 2 \times 10^{70} \, \text{tries}
\]
Since we already figured out that there are 4.35 x 10
17 seconds in 13.8 billion years, that means there are 8.7 x 10
87 tries in the history of the universe.
\[
\normalsize
2 \times 10^{70} \, \text{tries/sec} \cdot 4.35 \times 10^{17} \, \text{sec} = 8.7 \times 10^{87} \, \text{tries}
\]
Now we can adjust the original probability we got to account for all these generous probablistic resource. Now, we get,
\[
\normalsize
1 - \left(\frac{20^{200} - 1}{20^{200}}\right)^{8.7 \times 10^{87}}
\]
That's an improvement, and although my dinky calculator can't give you the actual number, you should be able to tell that it's still an extremely small number. Look at that fraction. The numerator and denominator are almost exactly the same because if you subtract 1 from a number as big as 20
200, you haven't subtracted much, relatively speaking. That means the fraction is extremely close to 1, which means that number raised to 10
87, though smaller, is still going to be very close to 1. And that means 1 minus that number is going to be very close to zero. And that means there's nearly a zero percent chance of getting the target protein.
Up until now, we have only been trying to calculate the odds of getting one specific sequence of amino acids 200 units long. But if we are just trying to find out what the odds are of getting any functional protein given the same probablistic resources, our odds should greatly improve. The reason is because for any sequence of amino acids of some given length, there is more than one sequence that could be functional.
There are two things necessary for a protein to be functional. First, it needs to be able to fold up into a particular shape and hold that shape. Second, it needs to exist in an environment in which it serves a purpose. I'm going to ignore that second requirement for the sake of this thought experiment because that would complicate things. Whether a protein serves a purpose depends on the shape of every other protein in its evironment. For the purposes of this thought experiment, I'm going to assume that any protein that can fold up into a stable shape has the potential to be functional. I just want to know what the odds are of getting any potentially functional protein in the history of the universe given our generous probablistic resources.
It is at this point in the game that I have run up against a wall. To continue the thought experiment, I need to know, out of all the 20200 possible sequences of amino acids in my protein, what fraction of them are capable of folding up into a stable shape.
We know already that you can alter a few of the amino acids in a sequence and still end up with the same functional protein. If that weren't the case, we'd all be genetically identical. It is our genes that store the information to build our proteins. Two people can have the same gene that codes for the same protein, but there will be slight differences between them. Those differences are what make us genetically unique. It's why DNA evidence is useful in criminal investigations. It's also why 23andME can find your relatives. The closer the relation, the more similar the DNA sequence.
Besides variations in the same protein, you can have completely different proteins (i.e. proteins that fold into a different shape and perform a different function) that are the same length, or close to the same length.
If all we looked at were the proteins that exist in nature, almost all of them are functional. Otherwise, nature wouldn't have preserved them. So we can't just look at the existing proteins to estimate how many sequences in 20200 could be functional. I've seen people make that mistake.
It would be great if we could build that many and just see for ourselves what fraction of them fold into stable shapes. But 20200 is too many, and they're not easy to make anyway. Another way is to use computer simulations. We could just have a computer predict how they would fold.
Predicting how a sequence of amino acids will fold up has been a nortoriously difficult problem for a while now. Veritasium recently posted a video about it you should check out. Mithuna Yoganathan at the Looking Glass Universe channel also made a video about it a while back. The good news is that it looks like, thanks to AI, the notorious protein folding problem has been solved. AI can now predict, with 90% accuracy, how a given string of amino acids 30 units long will fold up. Until this breakthough came along, I don't see how anybody could possibly know what fraction of proteins of a given length could fold up into stable shapes. Now, it looks like it's possible to figure it out.
How would they do it, though? One way would be to try every sequence. There's probably not enough computing power for that, though. It might work if you were only considering sequences 10 or 20 units long, but if you try 200 units long, no computer has that kind of power.
Another way is to try a representative sample size and extrapolate. Maybe they could try a million random sequences to see what fraction of them fold up into stable shapes. Then they could try another million and see if they get the same fraction. If they do, then they can extrapolate to the whole 20200 possibilities and estimate the fraction of them that can make functional proteins. Will somebody out there please try this? I would love to know.
I was recently reading Stephen Meyer's book, The Return of the God Hypothesis, and I was relieved to see that Meyer addressed this issue I was having. This exerpt gave me hope that I was at least thinking it through correctly. He said,
Nevertheless, when I first met Denton, he told me that it was not yet possible to make a conclusive mathematical determination of the plausiblility of a random mutational search for new functional genes and proteins. Molecular biologists, he told me, could not yet quantify how rare functional DNA sequences (genes) and proteins were among all the possible sequences of nucleotide bases and amino acids of a given length. Consequently, they couldn't yet calculate the relevant probabilities - and thus assess the plausibility of random mutation and natural selection as a means of producing new genetic information.
This looks to be on page 309 or 310, but I'm using a Kindle, so I can't be sure. Anywho, when I read that recently, I was all like, "That's what I've been saying!" A few pages later, he repeated basically the same thing. He said,
They also need to know how rare or common functional arrangements of DNA are among all the possible arrangements for a protein of a given length. That's because for genes and proteins, unlike in our bike-lock example, there are many functional cominations of bases and amino acids (as opposed to just one) among the vast number of total combinations. Thus, they need to know the overall ratio of functional to nonfunctional sequences in the DNA.
That's on page 312, I think. A few pages later, Meyer said he met Douglas Axe who had tried to answer this question. Axe determined that functional proteins are extremely rare. Meyer writes,
How rare are they? Axe set out to answer this question using a sampling technique called site-directed mutagenesis. His experiments revealed that, for every one DNA sequence that generates a short functional protein fold of just 150 amino acids in length, there are 1077 nonfunctional combinations - combinations that will not form a stable three-dimentional protein fold capable of performing a specific biological function.
If I'm reading that right, it would mean only 1 in 1077 sequences 150 amino acids long are functional. How many is that? We can figure that out with a ratio.
\[
\normalsize
\frac{x}{20^{150}} = \frac{1}{10^{77}}
\]
So,
\[
\normalsize
x = \frac{20^{150}}{10^{77}} = 2 \times 10^{73}
\]
The probability of
not getting a function sequence in 1 try would be,
\[
\normalsize
\frac{20^{150} - 2 \times 10^{73}}{20^{150}}
\]
And the odds of getting a functional sequence in 8.7 x 1087 tries are,
\[
\normalsize
1 - \left(\frac{20^{150} - 2 \times 10^{73}}{20^{150}}\right)^{8.7 \times 10^{87}}
\]
Will somebody out there with a fancy schmancy calculator please calculate that and leave a comment with the answer? We can probably simplify it with an approximation. This should give close to the same result:
\[
\normalsize
1 - \left(1 - \frac{1}{10^{77}}\right)^{10^{88}}
\]
That looks to me like it would give a result close to 1, meaning nearly a 100% chance. I wonder what would happen if I assumed more realistic probablistic resources.
After playing around on my calculator, using more manageable numbers, I noticed that if the outer exponent (e.g. the 1088 in the above equation) is higher than the number in the denominator (e.g. the 1077 in the above equation), the probability is close to 100%, and if it's lower, the probability is close to 0%. It's only when they are close to each other that you get a probability in the 20 to 80% range. Let me see what happens if I take Douglas Axe's word for the 1 in 1077 figure and use more reasonable probablistic resources.
Let's keep the assumption of 2 x 1023 stars in the observable universe. Not all stars are going to have habitable planets because, for example, red dwarf stars are more active and are likely hostile to life. About 70 to 80% of stars are red dwarves. I also suspect that stars living near the centers of galaxies aren't as conducive to life. A generous, but more realistic estimate, for the number of habitable star systems, then, would be half of the total stars, so let's go with that: 1 x 1023. That's a simpler number anyway.
Let's assume all these star systems have 1 planet or moon with amino acids, temperatures, and other conditions capable of supporting life. Now we have 1 x 1023 planets.
Red dwarves live longer than medium sized or ginormous stars, but we've elminated most of them. The more massive a star is, the shorter its life, so the less time there is for life to emerge. Since our star is medium sized, and since they say life has maybe 1 billion years left, let's assume the average planet has 5 billion years in which to produce a functional protein. So instead of calculating the number of seconds in 13.8 billion years, we're going to use the number of seconds in 5 billion years. That's 1.57 x 1017 seconds.
Let's stick with 1 try per second, but this time, we're not going to assume each planet is nothing but amino acids. We'll still make a generous assumption, though. Let's assume 1/4 the mass of earth's oceans are made of amino acids. According to the internet, there are estimated to be 4.64 x 1043 water molecules in earth's oceans. A water molecule is made up 3 atoms, so that's 13.92 x 1043 atoms. We're taking 1/4 of that, so that's 3.48 x 1043 atoms. The average amino acid is made up of 10 atoms, so there are 3.48 x 1042 amino acids. Our target protein this time is 150 amino acids long because we're using Axe's number. So there are 2.32 x 1040 proteins on each planet at any given moment.
Now we can calculate the number of tries.
\[
\normalsize
1 \frac{\text{try}}{\text{sec}} \cdot 2.32 \times 10^{40} \frac{\text{proteins}}{\text{planet}} \cdot 1 \times 10^{23} \, \text{planets} \cdot 1.57 \times 10^{17} \text{seconds}
\]
\[
\normalsize
= 3.5 \times 10^{80} \, \text{protein tries}
\]
Our new probability is,
\[
\normalsize
1 - \left(1 - \frac{1}{10^{77}}\right)^{3.5 \times 10^{80}}
\]
It looks like with the more realistic assumptions, albeit still generous, we still have a probability near 100% that at least one functional protein will be created somewhere in the observable universe. Of course in reality, you need thousands of proteins for life, and you need them all on the same planet, so maybe, just maybe, things will turn out to be unlikely after all.
I'm not totally convinced by Meyer's argument, even if the probability is small, because I don't really understand how Axe came up with this number, and I don't know whether his number is accepted by the community of geneticists and biologists out there. I don't know if there's any controversy about it or whether it's a widely accepted estimate.
I'm a little skeptical for the reasons I explained earlier--the fact that until recently, there was no way to predict how proteins would fold just from knowing the sequence. Whatever method Axe used, it seems like the method I suggested earlier would probably work better. I feel arrogant saying that given how little I know and understand, and I don't mean to sound that way. I'm just expressing what makes sense to me.
There's another issue that's relevant to this whole conversation, and that's how new genes/proteins are formed. There are lots of ways they can come about, and the way they come about should have some bearing on their probabilities. If you were just creating a fresh protein from scratch, the probability of getting a functional sequence would be far less than if you took two already existing functional proteins and spliced them together. Since we already know that each half folds into a stable shape, it's not that unlikely that the combination will also fold into a stable shape.
There are other ways to create new proteins, too. One way, is to cut one in half. Another way is to insert a sequence in an already existing protein. You could even insert a sequence that existed in a different functional protein. You could take a functional protein and delete a section, and it would probably result in a different functional protein. So there are all kinds of ways to get new functional proteins from old ones, and those don't strike me as being nearly as improbable as creating one from scratch.
However, according to what I've read, there are genes and proteins that do emerge seemingly from scratch. They call them de novo genes or orphan genes. They don't have any known precursor. Some of these de novo genes might have precurors that are just lost to biological history. It doesn't mean they didn't exist. But some appear to have somehow emerged from what used to be called the "junk" part of the DNA. In a sense, they did emerge from scratch. It seems to me this argument I'm trying to think through would only apply to de novo genes that emerged from scratch or from the "junk" part of DNA, if there is such a thing.
If there's a section of DNA that doesn't code for proteins or that doesn't serve some other purpose, it should be blind to the forces of natural selection. Natural selection tends to preserve useful sequences and gets rid of harmful sequences. But if there's a sequences that is neither useful nor harmful, then for all practical purposes, it's random. If a gene emerges from a random sequence, then that would be an example of a de novo gene. A de novo gene like that can't be built up over time by making small improvements to an already existing functional sequence. These types of genes must feel the full force of the improbability we tried to calculate earlier. These types of genes used to be thought rare, but it turns out they are more common than once thought.
If anybody ever does the experiments I suggested earlier, here are a couple of things I would like to know.
First, I would like for somebody to pick some length to test, using simluations, AI, or whatever, and get an estimate of what fraction of sequences of that length can form functional proteins.
Second, I would like for somebody to do the same thing with a handful of other lengths. They could maybe test lengths of 20 amino acids long, 50, 100, 150, 200, etc. I would be curious to know if the fraction is the same for each length or if it's different. If it's different, I would like to know whether the fraction increases or decreases with length. Maybe you could plot it on a graph. It would be interesting to know if there's a curve to it. Maybe somebody could come up with an equation to describe the curve and discover a new law of biology or something.
That's where I'm at right now. I would love to hear your thoughts on this subject, so leave a comment.
Here's tomorrow's post on the same subject where I asked ChatGPT to pick my estimates.