Table of Contents
2020’s been a crazy year, but even with (or, should we say, because of) all the insanity, data science has had a whirlwind 12 months. The COVID-19 pandemic brought the field into the spotlight as the world attempted to grapple with and understand what was going on and how to quantify it. More importantly, data science became a key way for organizations to cut through the uncertainty in most financial landscapes.
However, more than just responding to COVID, there were several surprising innovations, major news, and some intriguing trends heading into next year. We wanted to know more, so we sat down with Shai Yanovski, Explorium’s VP of Data and Data Science, to see what he thought of the year that was, his biggest surprises, and what he sees for machine learning and data science heading into 2021.
In a year of crazy twists and turns, what was your biggest data science or ML surprise?
I think the massive leap in computational power combined with new, more powerful ML models and the implications it brings. Microsoft’s Turing Natural Language Generation (NLG) model used an incredible 17 billion parameters, which is intended to reduce the number of examples needed to train a model successfully. The most surprising part about it is that it wasn’t even the most powerful model released this year. OpenAI’s GPT-3 ran 175 billion parameters and was trained with 45 terabytes of text data. What really interested me was that these models were trained in an unsupervised manner, and so far, they’ve been really promising. On the other hand, there were some concerns about the potential misuse of these models, raising ethical questions about AI safety and its potential use for misinformation.
What was the most interesting trend in data science this year?
This year we’ve witnessed the establishment of a new class of transformer-based language models, which have made new and very interesting natural language processing (NLP) use cases possible.
Another interesting trend was the expanded use of AI in biology, which became a go-to for various tasks, including genetics, imaging, and chemical discovery. The best-known project was Deepmind’s AlphaFold 2 model, which helped solve a 50-year old problem for modern biologists around protein folding.
One of the things I think really became important for data science was the automation of data discovery. For data scientists, data acquisition and processing is still a major investment and can really halt a project. Automated data discovery makes this process more efficient by offering more data sources and of better quality. It’s difficult not to imagine using tools like these when it comes to searching through vast amounts of potentially relevant data. Automated data discovery makes data science teams more efficient, helps them deliver value on more use cases, and frees up teams to explore new areas and avenues.
Were there any stories you were following closely this year?
I think I was maybe most interested this year in the ethical implications of all these new powerful deep learning models. GPT-3 particularly had an interesting release. On the one hand, the hype makes it seem like the model is basically a fully functioning AI brain. This is a little misleading since what the model does is complete text prompts. However, it does raise several interesting questions. The first is that of bias, as it can impact both the model’s outputs and even the decisions it makes when completing text.
What really piqued my interest, however, is the potential misuse of models such as GPT-3. What if they’re used to produce spam or misinformation? The AI has already fooled several experts into believing a human-produced its content. This year we witnessed several wrongful arrests involving AI-assisted facial recognition applications. What if it’s used in these unethical ways, and what does it imply for AI and ML going forward as it becomes more powerful? The AI community will have to address those questions and develop guidelines and best practices to avoid this.
Enough about 2020. Let’s talk about 2021 instead. What do you see as the biggest trend driving innovation in data science next year?
I believe we’ll see a wave of innovation using the new NLP/NLU models I mentioned earlier, which will extend the frontiers of what we can accomplish with AI, especially in more use cases related to aspects of human-computer interactions. In particular, I think we’ll see leaps in machines’ ability to deal with unstructured inputs, better processing, and more open-ended questions, which will lead to better AI-assisted content generation. Similar trends will also continue in computer vision models. Ideas that were revolutionary last year have matured enough to be implemented commercially.
Reinforcement learning can be a big driving factor moving forwards, and I think that highlights another important trend. One of the things that’s been holding back reinforcement learning is the price of computational power. I think as cloud computing becomes more accessible, we’ll see reinforcement learning become a more common approach to solve a variety of use cases and ML-related problems. Obviously, there are still some challenges in this area, but I think 2021 will be a big year for reinforcement-learning-based models in the industry, and it’s going to be employed in a lot of really interesting ways.
Do you think any industries are ready to dive into the deep end in ML this year?
Industry predictions are not really my area of expertise, but if I had to answer, I think the medical/pharmaceutical and service sectors are poised to make a splash. Previously, I feel like one of the largest roadblocks for service providers was the relative newness of the field (when it comes to business). This translates into a few issues. First, it means that having the right infrastructure in place can be expensive, from the technology to the actual team members you need to succeed. Second, there is the fact that it’s hard to make it scalable and feasible. However, I think 2021 will see the industry evolve, leading to mature tools and platforms that can provide all these services and infrastructure (like Explorium!).
I think it will really come down to the openness to adopt AI by traditionally risk-averse industries, a particular challenge in the health sector. One of the silver linings of 2020 is that the health sector really saw the benefits of AI and data science, particularly in drug research, improving diagnosis capabilities, and improving patient service. I really hope that healthcare embracing AI leads other traditionally slow-moving industries to adopt data science and AI.
Any other thoughts before we go?
I have a feeling that 2021 will be a big year for the industry, and I’m excited to see what comes next!