CSB2009 Characterizing the space of interatomic distance distribution functions consistent with solution scattering data

Characterizing the space of interatomic distance distribution functions consistent with solution scattering data

Paritosh A. Kavathekar, Bruce A. Craig, Alan M. Friedman, Christopher Bailey-Kellogg, Devin J. Balkcom*

Department of Computer Science, Dartmouth College, Hanover, NH, USA. devin@cs.dartmouth.edu

Proc LSS Comput Syst Bioinform Conf. August, 2009. Vol. 8, p. 239-250. Full-Text PDF

*To whom correspondence should be addressed.

Scattering of neutrons and x-rays from molecules in solution offers alternative approaches to the studying of a wide range of macromolecular structures in their solution state without the need of crystallization. In this paper, we study one part of the problem of elucidating three-dimensional structure from solution scattering data, determining the distribution of interatomic distances, P(r). This problem is known to be ill-conditioned; for a single observed distraction pattern, there may be many consistent distance distribution functions. Due to the ill conditioning, there is a risk of overfitting the observed scattering data. We propose a new approach to avoiding this problem, accepting the validity of multiple alternative P(r) curves rather than seeking a single best. We show that there are linear constraints that ensure that a computed P(r) is consistent with the experimental data. The constraints enforce smoothness in the P(r) curve, ensure that the P(r) curve is a probability distribution, and allow for experimental error. We use these constraints to precisely describe the space of all consistent P(r) curves as a polytope of histogram values or Fourier coefficients. This description can then be used to sample the space of potential alternative P(r) curves. We use this description to develop a linear programming approach to sampling the space of consistent, realistic P(r) curves. In tests on both experimental and simulated scattering data, our approach efficiently generates ensembles of such curves that display substantial diversity. In particular, we show that the ensemble of P(r) curves generated for a given protein includes members that are more different from a reference curve for that protein than are reference curves for proteins of other structural topologies. Thus subsequent reconstruction steps must properly account for this P(r) diversity in optimizing structural models.

[ CSB2009 Conference Home Page ] .... [ CSB2009 Online Proceedings ] .... [ Life Sciences Society Home Page ]