Curious minds select the most fascinating podcasts from around the world. Discover hand-piqd audio recommendations on your favorite topics.
piqer for: Global finds Technology and society Health and Sanity
Nechama Brodie is a South African journalist and researcher. She is the author of six books, including two critically acclaimed urban histories of Johannesburg and Cape Town. She works as the head of training and research at TRI Facts, part of independent fact-checking organisation Africa Check, and is completing a PhD in data methodology and media studies at the University of the Witwatersrand.
Even great programmers will tell you how tricky it can be to find a problem in lines and lines and lines of code. But, as students and users or consumers of 'data' in the bigger and broader sense, sometimes the problems we encounter exist—or are introduced—before the data even enters the 'machine'. And this is one of the biggest challenges facing the development of computer-based problem solving right now: The quality of the data we are using in our programmes.
Algorithms are, simply, a set of instructions (given to a 'computer') in a specific language. Machine learning (ML) is something that takes this an exponential step further, and instructs the 'computer' to learn from itself and its data, as part of the algorithm. In order to enable this process, machines need starting data to 'learn' from. Typically, the training data are sets of information that have already been examined and filtered and checked by humans. Computers use these to learn how to process and filter different inputs, and what are considered 'correct' outputs, and then they are able to replicate those outputs at a later stage, on new data. If you can imagine those annoying 'captcha' screens you encounter when you have to verify you are not a robot when logging in to a website? Those are actually training data, and you are providing input while verifying yourself!
We know bad data produces bad results. But what if the training data that gets put into a ML programme is flawed, or dirty, or over-cleaned, or biased? Well, as this article in the Harvard Business Review correctly argues, it compromises everything the 'machine' learns, and because of the speed and frequency with which some programmes perform and repeat their functions, even small errors can cascade into large deviations – a bad data snowball effect.
This article clearly explains the problems and challenges with the quality of data being used to train machines, and suggests several simple, effective, and essential solutions.