• StARlite - mobile edition

    Thanks to the nimble-fingered wizard Mark Halling-Brown, now of the Institute of Cancer Research I now have a working MySQL version of StARlite 31 on my laptop, and therefore a MySQL distro of the database. Of course, I don't have chemical structure query ability (yet), but it is a step forward. It runs very fast indeed, and is very handy for off-line work, ad hoc demonstrations, and analysis. The laptop spec is a MacBook Pro (early 2009) 2.5GHz, 4GB RAM.

    The picture above is something I picked up from the excellent Pixdaus (I think). It captures the exuberance of youth and celebration - I now only have one of these (not youth!); but it is nonetheless a nice picture. If you are in the photo, and want me to remove it, please let me know.

  • StARlite Schema walkthrough web-meeting

    The next web-meeting for a walkthrough of the StARlite schema, data model assumptions and sample queries will be at 11am to noon local UK time (so at this time of year GMT/UT) on Friday 20th March. If you wish to take part in this meeting please use this link (do not modify the header of the email in any way!).

    The last time we tried a web meeting, my domestic broadband connection could not cope with audio and the slides at the same time, so you will need to dial into a UK land line number; unfortunately, this will not be a freephone number.

    Finally, if you can't make this time, we will set up a similar meeting in another few weeks or so.

  • ChEMBL Group Research Retreat 2009

    I have provisionally planned the ChEMBL group retreat for 2009. It will be in Crieff, Scotland, and will be in late September. The format will allow detailed discussion and brainstorming of ideas for the ChEMBL project, and will be themed around the following areas.

  • Open-Source Drug Discovery and Open Science.
  • Patent data-mining and indexing.
  • Auto-curation and in-line predictive model generation.
  • An Ontology for drug discovery screening cascades.
  • A web-services primer.

    The mornings and evenings will be informal discussions of science, while the afternoons will be fun, fungi, flora and photography (who said alliteration is dead!) oriented walks in the wooded areas around Crieff. The picture above is of a reasonably rare parasite of truffles (a Cordyceps sp.) found in Crieff around the same time of year in 2008. This is the fruiting stage of the fungus (the teleomorph), the non-fruiting body stage (the anamorph) of a closely related fungus is the source of the powerful immunosuppresive drug cyclosporin, wow!

    I will try and get some really special VIP guests as surprise speakers and participants! ;)

  • Software - Papers for the iPhone

    Just downloaded and synced up Papers for the iPhone from mekentosj. What a great little app, beautiful interface, very snappy performance, and allows the carrying of a whole bunch of literature in your pocket, and also searching/downloads from your handheld. What more could a hipster mobile scientist want? (Apart from good 3G coverage, 64GB of memory, free Wi-Fi everywhere, and free journal access, of course).

  • Why aren't interfering RNA's used by pathogens?

    The world is full of things that want to kill me, and then without any conscience, eat my conveniently packaged and substantial store of proteins and fats; even when they don't want to kill me completely, they want to use my body as a well stocked larder and home, sometimes for a long time, and then move on to the body of others. I unconciously spend substantial time and energy fending off these bacteria, fungi, viruses, helminths, protozoa, insects, and so forth, and huge portions of my highly elaborate biological sub-systems (the innate and acquired immune system, stomach acids, membranes, up to my learnt social hygiene habits, etc.) are assigned to this never ending battle. When I die, within a matter of hours the hungry hoards will win - this biological battle and arms race has been going on for hundreds/thousands of millions of years.

    Oh yes, there are, in turn, many things that try and protect themselves from my personal need for their fats and proteins, as the incovenient, distasteful and poisonous nature of many fungi, plants, and the guile and cunning of wild animals show.

    Many pathogens have evolved molecules to suppress my defenses, or to allow better access to my 'resources' (or interfere more generally with normal life, like the wonderous self-healing circulatory blood system), and allow them to enter unobserved, or to maintain long term 'silent running'. These molecules fall into two general classes... (I have been broad in including various classes of bioactives in this list, and not all of these are directly involved in human pathogenic processes, but hopefully you get the point).

  • Proteins and Proteins - things like hirudin that prevent my blood clotting when a hungry leach wants me - functionally analogous (but structurally distinct and polyphyletic) proteins are found in vampire bat saliva, the blood of ticks, and so on; another example would be a large number of secreted proteins in helminths that suppress the normal process of antigen recognition and clearance, an example here would be Neutrophil Inhibitory Factor from hookworms, or ES-62 again from a nematode.
  • Small molecules - things like statins, made by fungi to kill other fungi, aflatoxins from various fungi, Tacrolimus/FK-506 used as an immunosuppresant drug. These types of molecules are generally known as natural products.
  • DNA/RNA - ?

    Both of these classes of molecules have provided many life-saving and enhancing drug classes, their curative mechanisms often exploiting, these evolved toxic, or immuno-suppressive activities.

    So, given the huge power over biological processes initiated by various form of interfering RNA, and the triviality of making and encoding them for a pathogen (especially when compared to the very complex set of genes and metabolites required to make natural product toxins) why aren't there huge numbers of diverse RNAi-based mechanisms exploited by pathogens? There certainly is a lot of interest and money going into siRNA drugs. Pathogens, with all their fancy mechanisms for integration and often hijacking of cells, would also seem to be an ideal delivery vector for RNAi weapons. It seems so simple for these organisms to just evolve them, allowing immortalisation, suppression of their growth, etc., etc..

    There are two simple explanations for this - 1) maybe they do exist, we just haven't looked for or found them yet, or 2) maybe they can't because of some systems-level defense that we haven't fully characterised yet. Maybe the Toll-like receptors do this defense role for the more highly evolved eukaryotes. However, the lack of a TLRs in lower animals, plants and so forth, and especially fungi, the venerable masters of long-term, hand-to-hand chemical and biological warfare, may suggest a deeper-embedded system for defense from manipulation from small xeno-RNAs.

    Both 1) and 2) would have profound impact on the likely future success of RNA-based therapies.

    Maybe the answer is obvious to those that know; but it isn't to me, and I would be interested in knowing.

  • Books and Papers - 6 - Software Tools, Kernighan and Plauger

    This was the first proper programming book that I studied, and it is an old one too, written before the web, web-services, and networking - good old fashioned UNIX programming computation. Few words, lots of concepts, and merits revisiting now and then - the prose in Kernighan's books is just excellent, perfect pace, and combines advice with examples of little code snippets. Probably the best thing about the book (for me) is the use of ratfor - a derivative of Fortran that looks a lot like C, and has some of the best elements of both languages. It is also incredibly quick to code in. I have just downloaded some updated versions of ratfor for my mac, so expect some pretty unusual looking Open Source protein analysis tools anyday soon!

    Anyway, although ratfor is used, for almost everyone, the code there will just be pseudocode for implementation and inspiration of code in a more fashionable language. If anything, using a non-current language forces thinking more deeply about the actual structure of the program.

    %A B.W. Kernighan
    %A P.J. Plauger
    %T Software Tools
    %I Addison Wesley
    %D 1976
    %O ISBN 978-0201036695
    

  • Is Drug Discovery Getting Harder?

    Alongside all our scientific interests, we also like to think about some of the financial/business aspects of drug discovery, differing business models, discovery strategies, etc., I guess these fall under the general tag Operational Research. The OHE is also a great place to browse for a very broad range of health economics issues and ideas. Here is a little toy analysis, that may be of some interest.

    As some background, probably everyone has seen those time-series graphs comparing discovery costs (from PhRMA, ABPI, etc) and drug launches. A couple of things spring from this view, firstly that it is really, really expensive to discover drugs, even with constant currency corrections, and secondly that the per drug launched cost is inexorably rising.

    Here is something a little bit different….

    We took a simple list of all INNs, and for each of these, there was an associated date - the year in which the INN/USAN was approved. For background an INN/USAN (the ‘generic name’) is granted for a compound in clinical development when the applicant thinks there is a reasonable chance that the compound will be commercialised, i.e. it is a mark that the applicant is serious about the drug. Typically, but certainly not always, an INN/USAN is granted during phase II. This analysis is pretty easy to do; in fact, you can probably come up with this graph yourself with a soupçon of internet tomfoolery, so I won’t bother presenting that here - ‘an exercise for the reader’ as they say.

    What we did next was to map the internal research code to those INNs/USANs, this number is something like LP-12345 (where LP is an alphabetic code, by convention assigned to a given company. The number typically indicates the order of registration of the compound in the company's internal compound collection. So, between LP-12,678 and LP-32,129, there would have been 19,451 compounds made and registered. Some companies also use the convention of having a 'dash' then a final integer, and this often is associated with a particular salt (so LP-12,678-1 could be the hydrochloride salt). As an explicit example Viagra (Tradename) is Sildenafil (USAN/INN) is UK-92480 (Research code). So this would have been the 92,480th compound registered in the compound collection at the UK labs of Pfizer (for more details on research codes and a table of company assignments see the indispensable Merck Index). Remember, we do not have access to the dates the compounds were made, just the link between the USAN date and the research code. The companies that make the compounds of course know the exact day the compound was synthesized and then registered. A further proxy/correlate of the discovery date would be the first time the compound appears in a patent - but this is not useful for a number of reasons that are not relevant for the following discussion.

    So what we can plot pretty easily is the USAN date, and the order of the date of synthesis/registration. This allows us to come up with some pretty solid measures of the number of compounds required to be synthesised/purchased as a function of two key parameters, getting an assigned USAN, and getting and Approved Drug. What would the graph look like? Do we need to make more compounds per drug output now that we used to (Oh, once more for the halcyon days of drug discovery!) and if there is a larger number of compounds per drug output, are we making a more rational selection from a larger pool, and so reduce downstream attrition. Ideally, one would aspire to making and testing a smaller number of compounds per drug (or more practically, significantly reduced cost per compounds made per drug).

    As an aside, it would be great to have data on the number of research staff per year per company, but what with mergers, out-sourcing, in-licensing, and so on, I guess even the companies themselves would not have this data for more than the previous five or so years. If anyone does have reliable data on this (and they are free to share it), please contact me.

    For one large multinational company (it does not matter which company it is, really) the graph looks like the picture above. I chose to plot here data from one large company due to the fact that there will exist differences in business rules on applying for USANs, compound registration and numbering conventions - it removes a bunch of variables. However, every company we have looked at is similar in its general pattern.

    I think this is pretty interesting, the graph is clearly bi-phasic, there is a break/inflection in 1997. This reflects a very material increase in per drug/USAN rate between the two parts of the graph, about 30-fold in fact. This means that to get a USAN, roughly thirty times more compounds were required. Remember there will be an offset between the USAN date and the synthesis date, anywhere between four and six years (typically). So what happened in ~1992 from the discovery side of things that changed the world? One potential interpretation of the data (which non-surprisingly, I think is almost correct, and can stand quite a bit of challenge) is that molecular biology actually made things a lot worse for the industry, we (as an industry) were suddenly ‘target rich and context poor’ and relied on technology and larger resource levels to solve everything - more compounds, more targets, more screening,.... It really did make quite a lot of sense, back then. ‘Novelty’ became a very strong and seductive and compelling concept (since it allowed us better patent protection, commercial advantage, it allowed the scientists to do more interesting 'novel' research, and so forth). This ‘target richness’ was actually confounded by the fact that most targets were not actually tractable using technology we had at that time (largely small molecules), i.e. they were not druggable, and that we may have, in fact, mapped out most of ‘pharmacologically modulatable space’ at a pathway/systems level at that time, but not known it.

    Just before you say 'Ah-Ah' to yourself. Another reason for this increase in compounds per unit output is that large-scale expansion of compound collections occurred at just around that time. If they had actually had an impact on productivity it would have had an impact by now though if it was going to.

    A further way to look at this compound per USAN data is below. Note that the fraction of compounds that make it from USAN stage to launch is roughly constant (about a quarter for this company) for these compound cohorts. This fraction if USAN to launch is an interesting analysis for another time, maybe.

    My train is now arriving at the station, so I need to shut the lid of my laptop, and go. However, there are a number of assumptions in the above chain of argument, I will try and outline the ones we have considered in a later post.

    As always, if anyone is interested in the underlying data, please contact us. (Blue Obelisk Rules!)

  • Part-time data entry work

    We need some people to perform ad hoc data entry/quality control work. There will probably be about 5 hours, or so, a week per person, with payment on an hourly rate. Good understanding of the scientific literature, a methodical manner, excellent ability with technical english, the ability to draw chemical structures, and also high familiarity with Microsoft Excel is essential.

    You must have your own computer, software and internet connection.

    The ability to enter into a contract, and to receive payment in a conventional manner is absolutely essential.

    Thanks for all the interest in this, we now have enough applicants