Philochristos: ChatGPT on protein probabilities

Saturday, January 21, 2023

ChatGPT on protein probabilities

One of my favourite Christians bloggers, Paul Scott Pruett, make a post one time on The Statistical Case Against Evolution. I've been skeptical of this argument since he made it, but at the same time I've been fascinated by it. I have wondered whether and to what degree my skepticism has more to do with the limitations of my own understanding when it comes to all things biological or if there's really merit to my criticisms.

A few months ago, I attempted to get to the bottom of it. I decided I was going to try to work my way through the issue myself, look up some facts, do the actual calculations, think my way through it as carefully as I could, and see what happens. I started a blog post about it where I was going to think it through out loud.

I got stuck part way through it, though, because there were some questions I had that I couldn't find the answers to. When Scott (and others) make this argument, they attempt to simplify things by using, as an example, a protein that has a pretty average or conservative number of amino acids. Something around 200 is usually what I've seen in similar arguments. But I wanted to try to calculate the probabilities taking all lengths of possibly functional proteins into considerations. After all, for all I knew, it was possible that the fraction of functional proteins out of all possible sequences could be different for each length, and I had no idea whether or how it might differ.

Anywho, these questions were inspired by that effort. I never got the answers I wanted, but this was my attempt. Plus, I was just having a little fun.

Sam: What fraction of possible chains of amino acids two hundred amino acids long can form functional proteins?

ChatGTP: The fraction of possible chains of amino acids 200 amino acids long that can form functional proteins is extremely small. The number of possible amino acid sequences for a chain of 200 amino acids is 20^200, which is an extremely large number. However, the number of sequences that can form functional proteins is much smaller. The probability of a random sequence of amino acids forming a stable, functional protein is estimated to be on the order of 1 in 10^77 for a 200-residue protein. This is an extremely small probability, indicating that the vast majority of possible amino acid sequences will not form functional proteins.

It is also important to note that even if a sequence of amino acids can form a stable protein, it does not necessarily mean that it will be functional or have a specific activity. For a protein to be functional, its amino acid sequence must be able to fold into a specific three-dimensional structure that allows it to perform its specific function.

It's also important to keep in mind that proteins are not the only type of functional molecules in living organisms, and there are other types of functional molecules, such as RNA, DNA, and other small molecules that perform important functions as well.

Sam: What determines whether a protein is functional or not?

ChatGPT: A protein is considered functional if it can perform a specific biological function. The ability of a protein to perform its specific function is determined by its three-dimensional structure, which is determined by its amino acid sequence.

The primary determinant of whether a protein is functional is its ability to fold into a specific three-dimensional structure. The process of protein folding is determined by the properties of the individual amino acids and the interactions between them, such as hydrogen bonds, electrostatic interactions, and hydrophobic interactions.

If a protein is not able to fold into its correct three-dimensional structure, it will not be able to perform its specific function and it's considered as a non-functional protein. The misfolding of proteins can lead to diseases such as Alzheimer's, cystic fibrosis, and some cancers.

Additionally, the chemical and physical properties of the environment also play a role in the folding and function of proteins. For example, the pH, temperature, and presence of other molecules can all affect the stability and function of a protein.

In summary, the function of a protein is determined by its ability to fold into a specific three-dimensional structure that allows it to perform its specific function. This ability is determined by the properties of the individual amino acids and the interactions between them, as well as the chemical and physical properties of the environment.

Sam: What is the shortest possible protein that could be functional?

ChatGPT: The shortest possible functional protein is a matter of ongoing research and debate in the scientific community. While there is no definitive answer to this question, some functional peptides have been identified with as few as five amino acids.

Amino acids are the building blocks of proteins, and the properties and function of a protein are determined by the specific sequence of amino acids that make it up. The specific sequence determines the three-dimensional structure of the protein, which in turn determines its function.

A number of functional peptides have been identified with as few as five amino acids. For example, the anti-inflammatory peptide, Melittin, which is found in bee venom, is composed of 26 amino acids, it's able to target cell membranes and disrupt their function. Another example is the antibiotic peptide, LL-37, which is composed of 37 amino acids and has the ability to target bacterial membranes and disrupt their function.

It should be noted that the examples mentioned are small peptides and not proteins, proteins are much larger and more complex molecules that perform a wide variety of functions in living organisms. Also, many proteins require the presence of other molecules, such as cofactors, to perform their specific functions.

In summary, while the shortest possible functional protein is still a matter of ongoing research, some functional peptides have been identified with as few as five amino acids, but these are not considered proteins. The size and complexity of a protein is determined by the specific functions it performs and the environment in which it operates.

Sam: What is the longest protein we know of?

ChatGPT:The longest known protein is Titin, which is a protein that helps to maintain the structural integrity of muscle. Titin is composed of around 34,350 amino acids and it's found in the muscle fibers of most animals, including humans. The protein gets its name from the titin gene, which is responsible for encoding the protein.

In addition to its length, Titin is also notable for its complex structure, which includes multiple domains, each with a specific function. The protein is composed of different regions, including a series of repeating units called immunoglobulin domains, which are responsible for binding to other proteins and helping to maintain the structural integrity of the muscle. It's important to note that Titin is not the only protein that is very large, there are other examples of large proteins, however, Titin is considered the longest one. The length of a protein is determined by the specific functions it performs and the environment in which it operates.

In summary, the longest known protein is Titin, a muscle protein composed of around 34,350 amino acids. It is important for maintaining the structural integrity of muscle. The length and complexity of a protein is determined by the specific functions it performs and the environment in which it operates.

Sam: If you took every possible sequence of amino acids between 20 amino acids long and 30,000 amino acids long, what fraction of them could be functional?

ChatGPT The fraction of possible amino acid sequences between 20 and 30,000 amino acids long that can form functional proteins is extremely small. The number of possible amino acid sequences for a chain of 20 amino acids is 20^20, while the number of possible amino acid sequences for a chain of 30,000 amino acids is 20^30,000, which is an extremely large number. However, the number of sequences that can form functional proteins is much smaller.

The probability of a random sequence of amino acids forming a stable, functional protein is estimated to be on the order of 1 in 10^77 for a 200-residue protein. While the probability for a smaller protein would be less, it's still an extremely small probability, indicating that the vast majority of possible amino acid sequences will not form functional proteins. It's also important to keep in mind that the probability of a functional protein does not only depends on the number of amino acids but also on the sequence of the amino acids, the environment and the specific function that the protein is supposed to perform.

In summary, the fraction of possible amino acid sequences between 20 and 30,000 amino acids long that can form functional proteins is extremely small, and it's estimated to be on the order of 1 in 10^77 for a 200-residue protein. The probability of a functional protein depends not only on the number of amino acids but also on the sequence, the environment and the specific function that the protein is supposed to perform.

Note from Sam; not part of the conversation with ChatGPT: It seems doubtful to me that 1 in 10⁷⁷ would be correct because earlier in the conversation Chat GPT said 1 in 10⁷⁷ was the fraction of possibly functional proteins among all the possible proteins 200 amino acids long. If he's correct in both cases, it would mean the fraction is the same across all lengths (or at least the average of them) as it is across just that one length, which I'm skeptical of. It's possible, though. If I knew for sure, it would help me with my thought experiement. Chime in in the comments if you have anything to say about it.

Sam: How many amino acids exist on earth?

ChatGPT: There are 20 common naturally occurring amino acids that are found in living organisms on Earth. These amino acids are the building blocks of proteins, which are essential for the structure and function of all living cells. The 20 common amino acids are: Alanine, Arginine, Asparagine, Aspartic acid, Cysteine, Glutamic acid, Glutamine, Glycine, Histidine, Isoleucine, Leucine, Lysine, Methionine, Phenylalanine, Proline, Serine, Threonine, Tryptophan, Tyrosine and Valine.

It's worth noting that while these 20 amino acids are the most common and found in most proteins, there are other, less common amino acids that are also found in living organisms. For example, Selenocysteine, Pyrrolysine and other modified amino acids are found in some organisms. Additionally, non-proteinogenic amino acids, which are not found in proteins, are also found in nature.

In summary, there are 20 common naturally occurring amino acids that are found in living organisms on Earth, these are the building blocks of proteins. But, there are other, less common amino acids that are also found in living organisms.

Sam: Are there any proteins that exist in nature that are not associated with a biological organism?

ChatGPT: Proteins are complex molecules that are essential for the structure and function of all living cells, and therefore are typically associated with biological organisms. However, there are some examples of proteins that exist in nature that are not associated with a biological organism.

One example is prions, which are proteins that can exist in a misfolded state and can cause diseases such as mad cow disease. These misfolded proteins can infect other proteins and cause them to also adopt the misfolded conformation. Prions are not alive, they are not cells and they are not associated with a biological organism.

Another example is protein crystals, which are formed by the orderly arrangement of protein molecules in a crystal lattice. These crystals can be found in mineral veins and in some meteorites. These crystals are not alive and not associated with a biological organism.

Additionally, some proteins can be found in the form of protein fibrils, which are long, thin fibers made up of protein molecules that are arranged in a parallel fashion. These fibrils can be found in various natural materials such as spider silk, hair and wool.

In summary, while proteins are typically associated with biological organisms, there are some examples of proteins that exist in nature that are not associated with a biological organism, such as prions, protein crystals, and protein fibrils.

Sam: How many atoms make up the earth?

ChatGPT: The number of atoms that make up the Earth is an extremely large number, and it is difficult to estimate the exact number. However, it's estimated that the total number of atoms in the Earth is approximately 10^50 atoms. This estimate is based on the average density of the Earth, which is approximately 5.5 grams per cubic centimeter, and the total volume of the Earth, which is approximately 1 trillion cubic kilometers.

It's worth noting that this number is only an approximation and that the actual number of atoms in the Earth could vary depending on various factors such as the composition of the Earth's crust, atmosphere, and core. Additionally, the number of atoms in the Earth is constantly changing due to processes such as erosion, volcanic activity, and the movement of tectonic plates.

In summary, the number of atoms that make up the Earth is an extremely large number, estimated to be approximately 10^50 atoms, but it could vary depending on various factors and it's a constantly changing number.

Sam: If every atom in the earth were part of an amino acid, and there were roughly the same number of each of the 20 common amino acids, what is the probability that they will form a functional protein if they all rearranged themselves once per second for a billion years?

ChatGpt: The probability of all the atoms in the Earth, even if they were all part of amino acids and there were roughly the same number of each of the 20 common amino acids, forming a functional protein if they all rearranged themselves once per second for a billion years is extremely low.

The number of possible amino acid sequences that can be formed from the 20 common amino acids is extremely large, even for a small number of amino acids. The probability of any particular sequence forming a functional protein is extremely small, on the order of 1 in 10^77 for a 200-residue protein.

Even if all atoms in the Earth were part of amino acids and the number of possible sequences were reduced by the same number of each of the 20 common amino acids, the probability of a functional protein forming would still be extremely low. Additionally, the probability of any particular sequence forming a functional protein also depends on the environment, and it's not only the sequence of amino acids that determines the function of a protein but also the three-dimensional structure of the protein, which is determined by the properties of the individual amino acids and the interactions between them. It's also worth noting that even if a functional protein formed, it would be highly unlikely for it to remain functional for an extended period of time, as the environment of the Earth would change, and the protein would be exposed to different conditions that would affect its stability and function.

In summary, the probability of all the atoms in the Earth forming a functional protein if they all rearranged themselves once per second for a billion years is extremely low, even if all atoms were part of amino acids and there were roughly the same number of each of the 20 common amino acids.

4 comments:

Paul said...: I'm impressed by some of the answers, and your questions are good for flushing out the boundaries of the discussion.

The first question is important, and the answer was interesting and nuanced. It starts out by affirming that there's only a small subset of possible arrangements that would be "functional." It then seems to present Doug Axe's research finding of 1 chance in 10^77 (https://pubmed.ncbi.nlm.nih.gov/15321723/), although I'm not sure of the size protein(s) he had in mind in his study. It then seeks to make the distinction between two forms of functionality: proteins that can successfully be folded into theoretically functional shapes, and proteins that can perform some actual function in the organism. I think Doug's research may only touch upon the folding question, because the second aspect would probably be impossible to accurately quantify. Even so, it is safe to say that it further aggravates the probability issue.

What these things address are probabilities of getting any kind of functionality at all out of a random amino acid chain. But that's not what evolution needs to do. It benefits only from acquiring specific functionality that is relevant to the organism in its environmental context. For instance, a jungle frog that gains a gene that helps it climb is meaningful, but if it gains a gene that adds an anti-freeze protein in its blood, then that could actually be a detriment and would, at best, not be conserved in the DNA. Therefore, the odds are further inflated by narrowing down the many "functional" proteins to only those that are desirable. But that also is hard to quantify. For these reasons, much of the ID argument must be made by way of generous "for the sake of argument" starting points, but then adding "but it's worse than that" logical qualifiers and filters.

You went in a fair direction with your questions about smaller-than-average proteins that could be functional. The pushback might be that perhaps evolution doesn't always have to go for the big odds-defying arrangements. There are indeed some small proteins, but it should also be remembered that few proteins work in isolation from other proteins upon which they either depend or combine to form a larger functional assembly/system.

In any case, since naturalistic evolution is by nature a game of chance, you should expect to see in biology (as a rule) the smallest proteins capable of getting any given job done. What does it say, then, that the average size of proteins is so large, and that there are some that are enormously larger than average?

Additionally, convergent evolution (at the genetic level) suggests that there are certain ideal target solutions at which evolution must/does arrive. If mutations were the dealing of cards, some would suggest that proteins are just various equally valuable arrangements of cards, with no need of looking in advance for specific winning hands. Convergent evolution is one area that suggests "chance" often does/needs-to arrive at royal flushes.

I think the probability issue is crippling for evolutionary theory, but it is also impossible to turn into a concrete formula to throw at its defenders. I think we know in principle (and experimentally in some cases) that probability roadblocks exist for a theory that proports to be driven by trillions of these chance events over time, but the vagaries in quantifying the problem allows some to inject imagination into the gaps.; 1/22/2023 1:06 PM
Sam Harper said...: Hi Scott! Thanks for commenting. I was curious what you might say. Yeah, I can see how it might be difficult to come up with probabilities. From what I understand, the way a protein folds isn't solely because of the sequence of amino acids it's made of. A lot of them require other things floating around in the cell to help it fold up in the right shape. That's kind of a wild card because it means you can't tell just from the sequence alone whether it's able to fold up into a stable shape.

Another question I have is whether we can predict how a protein might fold up from knowing the sequence if we haven't actually seen it in nature. Is our knowledge about functional vs non-functional sequences based just on observations, or is there a theory behind it that allows us to predict how proteins will fold?

The reason I ask is because if we can only know by observing how they do in fact fold, then I don't see how we could possibly know what fraction of the possible sequences could be functional. It seems to me that the important question is not what fraction are functional, but what fraction could be functional. Unless we have some way of being able to predict the functionality of sequences from some theory, there's just no way to know how many sequences might be functional since there are too many to ever test in the lab.; 1/22/2023 6:32 PM
Sam Harper said...: On second thought, maybe we could make an estimate based on a random sampling. If we could just produce a couple thousand proteins, all around 200 amino acids long, and see how many of them fold up into some stable shape, we could extrapolate to estimate what fraction of all the possible sequences are functional. I wonder if anything like that has ever been done. I suspect not since we're not to the point that we can build proteins like that. Or am I wrong about that?; 1/24/2023 8:26 AM
Paul said...: Only the smallest proteins have a chance of folding into their optimal form without aid. Larger ones risk making sub-optimal, messy, and incomplete products if left on their own. Additionally, if folding is not done quickly, correctly, and in a controlled way then it risks aggregation with other molecules in the crowded environment of the cell. This is where chaperone proteins and the endoplasmic reticulum (ER) come into play. Some chaperones can work in the cytosol (the space between the organelles), but for more complex folding and multi-subunit protein assembly, the ribosome (that assembles the amino acid chains) will temporarily attach to the ER to inject the polypeptide product for processing by a series of chaperones, which facilitate in a deterministic way the optimal assembly of the protein(s). And if that is not achieved (e.g., due to mutation), then it signals it for recycling.

Additionally comes the problem that most proteins don't just get released into the wild, but need to be packaged and transported to their intended site of operation, either inside or outside the cell. This is carefully coordinated between the ER and the Golgi apparatus. This is both for transport purposes but also because even properly folded proteins can be improperly reactive in the wrong location.

None of this negates the idea that the amino acid order of occurrence is very important, and that there are optimal arrangements. It only adds layers of dependencies on top of all this. The problem for evolutionary biology is that everything in need of explanation is a rabbit hole of irreducible complexity. There is no sense in which the more we understand about it all, the easier it is to explain, and we never seem to get to the bottom of any of it.

You are right that there are difficulties in predicting the final structure of proteins, but it is not a hopeless task. The fact that chaperone proteins work in a deterministic way gives us hints as to what steps are involved in arriving at final forms. This article is a good starting point:
https://en.wikipedia.org/wiki/Protein_folding

Proteins can be manufactured, but as I understand it, they do so by introducing the DNA required to build the protein into a host organism and let it do the dirty work for them. They then need to distill (purify) the product for harvesting and use.; 1/28/2023 11:37 AM