From the course: Foundations of Responsible AI

Big data and where it comes from

From the course: Foundations of Responsible AI

Big data and where it comes from

- User-generated content on the internet has created petabytes of data over the last 10 years. Since it's easy to scrape, we use this data to train machine learning models to perform tasks like translating text between languages, assisting doctors with medical diagnoses, and automatically creating lines of code from a simple comment describing a coding task. The problem is a lot of this data is not created or collected for the purpose of training ML models. Using data that wasn't created for the purpose of modeling can introduce various biases. For example, companies often collect user data in order to provide targeted ads that persuade audiences to purchase products. This can be harmless in some cases, but in the context of financial products like credit cards and payday loans, this commonly accepted practice can result in harmful and disproportionate outcomes. While training machine learning models like neural networks can cost millions of dollars, like in the case of OpenAI's GPT-3, creating appropriate data sets to train models is far more reliable for creating effective models. But the reality is good quality datasets usually take more time to collect than most teams are willing to spend. They can also be more costly to collect and harder to stumble upon when few open-source or low-cost comprehensive data sets exist. Good quality datasets are those that are large enough for computation, representative of people it will be used on, and collected responsibly. Enterprise companies often use the vast amounts of free user-generated texts on the internet to train large language models, even though they could probably afford to create their own data. Language models aren't the only ones trained on public internet data, but they tend to be some of the most harmful. Other examples are models in computer vision trained on images scraped from the internet. The problem with these is that they tend to represent the views and opinions of center-right Americans and have a heavy focus on Western values. While many of these projects have come from research organizations or large enterprises, it's crucial that we inspect the motives for collecting data about citizens and users. While some data collection is less harmful, like login data to your favorite forum, there are plenty of organizations that collect data for much broader use with less forthcoming agendas. For example, within the surveillance, government and defense industries, data collected about us is leveraged for reaching objectives like surveilling marginalized communities. When it comes to data collection, users and organizations aren't on the same level as far as power. Large tech companies can collect millions of data points about us without the regulatory requirement to inform us in ways that we can easily understand about what they're collecting and how they use it. We've promoted data-driven decision making so far that organizations believe the best course of action when developing a new product application is to collect as many data points as they can in hopes that it may be valuable later. Many times the information collected is not valuable and they end up wasting time and resources to find insights that aren't predictive of what they should be doing. While this is an organizational waste, there are also many ways that this process hurts users. By the time many organizations get around to using the vast amounts of data they collected, it's already grown stale. Sometimes we see that we've got users who have canceled their accounts, or they're not even associated with their names or email addresses anymore. Often, it also has personally identifiable information about users and hasn't gone through any kind of anonymization or preventative security measures. We haven't considered that when you store or move data, there may be manipulations going on that we haven't documented. This is often how data analysts and data scientists end up with databases or spreadsheets that are missing information or formatted incorrectly. This forces us to really think about if we should train models on historical data. There may have been errors in collecting data, moving data, or sensor and hardware errors that are impossible for us to completely correct or mitigate. Sensor and hardware errors are known as measurement bias. This happens when devices used to measure data, such as blood oxygen monitors, routinely underperform for some groups of people. The vast majority of data is flawed in some way, especially user generated content. YouTube comments aren't the best data source that we want machine learning models to emulate. When we train models based on this data, we create models that reflect the values of those who post most often, and the best solution is to collect custom data for the model that you are trying to build. Refrain from using data that's conveniently just sitting around; that happens to be more than a few years old, or wasn't collected for the purpose of being used to train AI.

Contents