Re: protein folding / scientific simulation with CAs Eugene Leitl (ui22204@sunmail.lrz-muenchen.de)
Search tool
29 Aug 1995 09:32:14 -0400

On 19 xxx -1, kr wrote:

> At 11:24 8/19/95, Eugen Leitl wrote:
> >The reliable prediction of protein tertiary structure, which is
> >a function of the primary sequence, and both context and history
> >(less so) is an absolute prerequisite for any practicable form
> >of protein engineering.
>
> No it is not. This is a common misconception. Please see Drexler's famous
> 1981 PNAS paper (vol 78 p.5275) for the differentiation between solving the
> scientific Protein Folding Problem (PFP) and the engineering style, inverse
> PFP.

Hmm. What stands PNAS for? Since I'll be probably unable to retrieve it, would you care to elucidate some of the salient points in just a few lines?

> >Even more, since in proteins form follows function we face the
> >inverse problem. Our task is much harder: we have to predict a
> >primary sequence which have to fold robustly into a given shape.
>
> On the contrary. This engineering task is likely much simpler than
> understanding nature's sequences. See above mentioned paper for some
> arguments.

I am not so sure. A friend of mine holds a similiar position: that intelligent (alas, with most of the intelligence located in front of the computer ;) modification of existing structures can produce a given space-filling. However, his recent work (he vgrep-searches for protein homologies, interrelation of prim and tert structures) seems to indicate that the clean view of evolutionary conserved building blocks is quite untrue. I mean, there are blocks, but even a minor modification (point mutation) changes things markedly. (Nature doesn't seem to like rules. Success is all what counts).

Moreover, the degree of precision needed for complementary surfaces (autoassembly) and their intricate shapes make such design _very_ difficult. If one thinks of the "DNA sequenzer - PCR template - transfection - selection - large-scale culture - isolation - crystallization - xRay structure (or NMR structure)" reiteration cycle (ok, random mutagenesis is much faster/better) then the whole idea of protein engineering acquires all features of a full-grown nightmare..

I mean, what I'd like to have is an (almost) fully automatical protein folding package, solving a protein in minutes to days. Happily, I can program a GA search and then move on to the beer garden ;)

[ snip ]

> >Scenario 2. Time: Mid 90's. Machine: coarse grained parallel
>
> >"The massive statistical match database search/neural network
> >pattern matching has shown the given residue to belong to a
> >certain domain (e.g. alpha helix) with a probability of 0.85.
>
> Presumably, this kind of approach has a lot of room for further improvement.

Don't you see: they make a prediction based on statistical data. In most cases the structure (at least at domain scale) is about right, in some absurdly wrong. But by choosing a certain structure (prefolding it) they might already had them maneouvred into a tight corner they'll never find out. Moreover, they still can't fine-fold. Their energy functions are too coarse.

This prefolding business is a double-edged sword, imnho. It might be a valuable tool to reduce the amount of computation for a later refinement (with a different method), but then early stages are really easy. I dunno.

The statistical approach is likely to make good progress on short-to-middle run. But it will certainly tie up on the long run. (You first saw it here ;)

> >4. Some Aspects of a Succesful Solution
>
> >Obviously, a sucessful simulation must _follow the folding
> >pathway(s)_. It must mimic all kinetic kinks along the real
> >origami. It must thus take _time_, the history into account.
>
> Following the kinetics is not necessarily the right thing to do, as it
> might just add to the computational load. If there is a way to show that a

But this is exactly the reason why it is of value: it reduces the computational load since we don't have to sample these areas (the vast majority of the statespace) which the protein never reaches for kinetical reasons.

> sequence folds stably into a specified 3D structure, and that there will be
> no "roadblocks" during folding, then that should be sufficient.

The natural proteins have been designed evolutionary to fold robustly. It is not merely the target form which counts: also a convenient (fast) way to reach it. Even so, though several chaperonins are used, a very noticeable fraction of proteins misfolds and has to be destroyed. _And this is nature_.

If one tries to purely-de-novo design a protein, the result will be a random coil, unless we optimise for robust folding kinetics.

And secondly, how do you know there is a single 3d structure (stable attractor, convergent system)? How do you know a priori how the 3d structure looks like?

> >Both the time (steps/increments) and space (coordinates) have to
> >be _quantized_, space even more roughly. This spells (scaled)
> >integers instead of floats.
>
> You can use integers for coordinates of atoms in any case. If you measure
> in units of picometers, a 32 bit non-negative integer will allow a range of
> 4 micro meters, more than enough.

Exactly. You are aware of that, most scientists aren't. They have even never heard of float artefacts (numerics is a black art, no?) and are absolutely astonished when they get absurdly wrong results from a correct algorithm. Well, floats.

> >The degrees of freedom being low-resolution integers facilitates
> >the use of look-up or hashing tables, possibly intelligent ones
> >with automagical interpolation.
>
> I agree that this should be explored further.

Yes. Especially since most tables contain smooth functions. E.g. linear interpolation will produce very good results at a fraction of the resources and without the use of an FPU.

> >_concentrate on strong local ones_ instead. This allows the use
> >of cellular automata (CA)-proven encodings as e.g. the light
> >cone hashing approach (sphere of influence at t being its base
> >and t+1 single residue conformation its tip). (See forthcoming
> >paper on CAs in simulation).
>
> Which papers ? Could you please supply pointers about the light cone
> hashing ? Sounds interesting. Which entity/entities do the hashing ? How
> many hash tables are there ?

It keeps being delayed :( a lot of them keep being delayed..) Where are we in here, nanotech? I'll post a short version. Look up in Moravec's "Mind Children" the HashLife chapter, though it is _very_ en passant.

Anyway, they could push the Life runtime by six orders of magnitude on the same machine. Additionally, the coding collapsed the memory requirements so they were able to run one billion-cells spaces at a side. Essentially, it worked because the light cone bases were not equidistributed. Moreover, they noticed that there was a certain fractality at different scales. So they used the normal Life neighbourhood as an index upon the tip vector (central/reference cell), fashioned higher-order indices, etc. I think this was a very big machine (a big Lisp machine?) with lots of RAM and lots of VM.

Anyway, in real systems we have a spherical symmetry (van der Waals, electrostatics), albeit there are anisotropical factors (covalents, shielding). Nevertheless the area of visibility is quite small, so a hash-code might be fashioned. I don't know, yet. It will certainly be not easy. (My current model doesn't use hashing at this scale since my nodes will be small/no VM).

> >incremental time step (a _sliding neighbourhood window over x,
> >y, z (periodically re)qsorted pointers on particle table_
> >encoding) or as _space slab cluster/voxel space_ encoding.
>
> Could you please elaborate more on these encodings ? This kind of stuff
> might carry a lot of promise.

Certainly. Currently I have abandoned the sorted pointer business for the voxelspace, it has much more promise. However: I don't think I can post the hog _here_, do you?

> >5. Plan of Action
>
> >5.5 Etc.
>
> Ambitious plan. :-)

You are telling me =)

Nevertheless a few months will suffice to figure out whether this approach works at all. If it doesn't... then the PFP is not simply a Grand Challenge.

[PNAS = Proceedings of the National Academy of Science]