We’ve been working on a new artificial intelligence engine for several months now. And while the programming of feedforward networks, recurrent Long-Term-Short-Term networks and generative adversarial networks may sound super cool and wonky, the real challenge has been cleaning noisy data. With over 100 billion datapoints in our database, there are bound to be some errors. And knowing what’s good and what’s bad takes up the majority of our time. So it’s good to know that we are not alone.
As this article notes (the bulk of which is behind a paywall), the quants at Gresham are finding that cleaning noisy data takes up to 70% of machine learning labor. That’s pretty close to what we’re experiencing.