Tuesday 29 September 2015

An executable language for change in biological sequences

A discussion on Twitter about whether there was a language for representing sequence edits prompted me to post my draft proposal for such a language. http://figshare.com/articles/Draft_proposal/1559009

Comments, criticism, collaboration and competition welcome. Hopefully I'll submit it shortly.

Saturday 5 September 2015

Notes from workshop on Computational Statistics and Machine Learning

I've just attended "Autonomous Citizens: Algorithms for Tomorrow's Society", a workshop as part of the Network on Computational Statistics and Machine Learning (NCSML). That's an ambitious title for a workshop! Autonomous Citizens are not going to hit the streets any time soon. The futuristic goals of Artificial Intelligence are still some way off. Robots are still clumsy, expensive and inflexible. But AI has changed dramatically since I was a student. Back in the days when computational power was more limited, AI was mostly about hand-coding knowledge into expert systems, grammars, state machines and rule bases. Now almost any form of intelligent behaviour from Google translation to Facebook face recognition makes heavy use of computational statistics to infer knowledge.

Posters: there were some really good posters and poster presenters who did a great job of explaining their work to me. In particular I'd like to read more about:
  • A Probabilistic Context-Sensitive Model of Subsequences (Jaroslav Fowkes, Charles Sutton): a method for finding frequent interesting subsequences. Other methods based on association mining give lots of frequent but uninteresting subsequences. Instead, define a generative model, then go on to use data and EM to infer the parameters of the model. 
  • Canonical Correlation Forests (Tom Rainforth, Frank Wood): a replacement for random forests that projects (a bootstrap sample of) the data into a different coordinate space using Canonical Correlation Analysis before making the decision nodes.
  • Algorithmic Design for Big Data (Murray Pollock et al): Retrospective Monte Carlo. Monte Carlo algorithms with reordered steps. There are stochastic steps and deterministic steps. The order can have a huge effect on efficiency. His analogy went as follows: imagine you've set a quiz with a right answer and a wrong answer. People submit responses and you need to choose a winner. You could first sort them all into two piles (correct, wrong) and then pick a winner from the correct pile (deterministic first, then stochastic). Or you could just randomly sample from all results until you get a winner (stochastic first). The second will be quicker.
  • MAP for Dirichlet Process Mixtures (Alexis Boukouvalas et al): a method for creating a Dirichlet Process Mixture model. This is useful as a k-means replacement where you don't know in advance what k should be, and where your clusters are not necessarily spherical.
Talks: these were mostly full talks (approx one hour), but then we had a short introduction to the Alan Turing Institute with Q&A at the end.

The first talk presented the idea of an Automated Statistician (Zoubin Ghahramani). Throw your time series data at the automated statistician and it'll give you back a report in natural language (English) explaining the trends and extending a prediction for the future. The idea is really nice. He has defined a language for representing a family of statistical models, a search procedure to find the best combination of models to fit your data, an evaluation method so that it knows when to stop searching, and a procedure to interpret/translate the models and explain the results. His language of models is based on Gaussian processes with a variety of interesting kernels, together with addition and multiplication as operators on models, and also allowing change points, so we can shift from one model combination to another at a given timepoint.

The next two talks were about robots, which are perhaps the ultimate autonomous citizens. Marc Deisenroth spoke about using reinforcement learning and Bayesian optimisation as two methods for speeding up learning in robots (presented with fun videos showing learning of pendulum swinging, valve control and walking motion). He works on minimising the expected cost of the policy function in reinforcement learning. His themes of using Gaussian processes, using knowledge of uncertainty to help determine which new points to sample were also reflected in the next talk by Jeremy Wyatt about robots that reason with uncertain and incomplete information. He uses epistemic predicates (know, assumption), and has probabilities associated with his robot's rule base so that it can represent uncertainty. If incoming data from sensors may be faulty, then that probability should be part of the decision making process.

Next was Steve Roberts, who described working with crowd sourced data (from sites such as zooniverse), real citizens rather then automated ones. He deals with unreliable worker responses and large datasets. People vary in their reliability, and he needs to increase accuracy of results and also use their time effectively. The data to be labelled has a prior probability distribution. Each person also has a confusion matrix, describing how they label objects. These confusion matrices can be inspected, and in fact form clusters representing characteristics of the people (optimist, pessimist, sensible, etc). There are many potential uses for understanding how people label the data. Along the way, he mentioned that Gibbs sampling is a good method but is too slow for his large data, so he uses Variational Bayes, because the approximations work for this scenario.

Finally, we heard from Howard Covington, who introduced the new Alan Turing Institute which aims to be the UK's national institute for Data Science. This is brand new, and currently only has 4 employees. There will eventually be a new building for this institute, in London, opposite the Crick Institute. It's good to see that computer science, maths and stats now have an discipline-specific institute and will have more visibility from this. However, it's an institute belonging to 5 universities: Oxford, Cambridge, UCL, Edinburgh and Warwick, each of which has contributed £5million. How the rest of us get to join in with the national institute is not yet clear (Howard Covington was vague: "later"). For now, we can join the scoping workshops that discuss the areas of research that are relevant to the institute. The website, which has only been up for 4 weeks so far, has a list of these, but no joining information. Presumably, email the coordinator of a workshop if you're interested. The Institute aims to have 200 staff in London (from Profs to PhDs, including administrators). They're looking for research fellows now (Autumn 2015), and PhDs soon. Faculty from the 5 unis will be seconded there for periods of time, paid for by the institute. There will be a formal launch party in November.

Next year, the NCSML workshop will be in Edinburgh.