This is a speech by Veronica mates, who is a technical program manager for human computation and measurement at Pinterest which encompasses all of the human evaluation and crowdsourcing efforts. This is the speech about the comparison of human versus machines but more specifically, it is about how Pinterest has leveraged mechanical Turk to improve the content that can be provided to the pinners.
At 0:42 Veronica mates state that like most of the internet companies nowadays Pinterest has a wide range of types of content and their key component is a pin at Pinterest but at the most important and core thing of that pin is an image.
The pin image must have objects in it and those objects must have labels and the pin of the image is basically related to different search queries, topics, annotations, and before a machine is smart enough to understand the connections between all of these pieces of content, a human being has to actually sit down and create those connections. And that’s what they at Pinterest think of as human email.
Multiple Micro Tasks to solve larger problems
At 1:23 Veronica mates state that it’s actually the process of having multiple perform micro tasks to solve a larger problem. All the micro-tasks tend to be labeling and the larger problem tends to fall into one of four categories that they rely on for human evaluation. And we will discuss each and give an example of how they have leveraged human evaluation in this way to improve the product.
Machine Learning Data
They actually use this category for a lot of different areas at Pinterest but the example we will walk through is for Pinterest ads or what we call promoted pins. As we know that Pinterest is a site about inspiration and discovery. People visit the site of Pinterest to find things that they might not know already exist but that they want to act on in their real lives. So they come looking for dinner recipes and find a recipe pin that they really like.
And so they take the ingredients off of that pin itself, they walk down to the farmers more by the ingredients make that meal for dinner.
Or they see a pin of a living room scene and they are really inspired by the atmosphere on that scene and they want to recreate it in their own living room. And they also want to buy that couch and the same coffee table and even paint their walls the same color as the walls in the pin. At 2:57 Veronica mates state that Pinterest ads can perform just as well as our organic content assuming that the ads are highly relevant to the inspiring moment. So earlier this year, the Pinterest team wanted to better understand the relevancy of the ads, they were providing of pinners with and they rated the text annotation relevance through human evil over 700,000 of their promoted pins.
They used the collected data to improve the precision of their text signal by 16% and that in the end realized an increase in the relevancy of different types of pins and different places placement of those promoted pins by between 20 and 70% and these number were so impactful for the team that they are actually now tracking towards a human evil metric for one of their key goals.
At 3:45, Veronica mates state that the second use case, she will talk about is labeling content for general content curation for marketing events or engineering services. The example she talked about was content curation for their dictionary service to explore pages.
Every day people come to Pinterest and they type in search queries that we may never have seen before and we need to decide if we want to create an explore page for that query and an explore page is like a topics page that renders the best content that we have related to that query but Pinterest has a strict policy under which we don’t want to create explore pages that breach policy.
And so they look at all the queries and run them through filtering and expansion and they also run these queries through human evaluation and their graders tell them either these are appropriate or inappropriate based on their policy. For appropriate queries, they create explore pages and for inappropriate queries, they put them into the blacklist.
Time series measurement
At 4:58 Veronica mates state that the third major case for human evaluation at Pinterest is for time series measurement of data. So, as we can say that search is an important aspect for a company like Pinterest and they hit over 2 billion ideas searches a month earlier this year and they want to increase this number and the only way that can be used to increase this number in a fair way is to make it sure that relevancy of the results for search queries are improving over time.
We can check relevancy improvement by tracking it against a baseline relevancy metric. So for this purpose, the Pinterest team take top 20 pins that are samples of head and tail queries and they run them through a human evaluation that evaluates the pin that is it relevant to this query and so they become able to aggregate across the highest performing and lowest performing queries a general relevancy score for that week and so they track their activity every week but if you notice that the relevancy decreases then you will understand that something went wrong and you should dig in and figure out the situation.
At 6:32 Veronica mates state that the fourth major use case for human evaluation at Pinterest is for A/B experimentation. So Pinterest is fond of running experiments on both font UI changes as well as back-end algorithm changes. Every time Pinterest has an experience the team wants to test out, they launched it to a really small set of users and over some amount of time, they make sure that none of their metrics are negatively impacted and then they increase the number of users who are triggered into that experience but sometimes their experiences can be controversial.
They risk putting users into a really negative experience that has a lasting impression and so what they introduced was actually doing an offline human evaluation on certain types of A/B experimentation. Here we take search ranking algorithm as an example, the team is now able to take all of the queries that had the highest churn in pin results because of that experiment that they are testing out and render treatment and control see experiences side by side.
So, they can ask a reader which experience is more relevant to query, and then, in the end, they can aggregate that experience overall relevancy to decide if it’s going to be positively impactful on the end-user experience. They can do this without impacting their actual users. In 2017, Pinterest significantly increased their usage of human evaluation. So they are the company that relies so much on human evaluation as compared to the machine evaluation.
In 2017, Pinterest decided to focus on some quality analysis and decided to start with two platforms that they used for their most common English language content. One of which was mechanical Turk which at that level they were using about 5% of human evaluation work and the second platform was platform 2 and they were using it about 70% of their work.
Then they compare the accuracy of two platforms to know about the accuracy and blacklist rate of the two platforms. So, as a result, they find that mechanical Turk did better than platform 2. So they started using a mechanical Turk of more than 5%. In mid-2016, they started using a self-serve human evaluation platform that was called Sofia those allows customers to come in and build reusable templates. While in 2017, the use of mechanical Turk increase significantly and they stop using platform 2 for English language content.
Throughout building Sofia, Pinterest had four key goals that they continued to track towards along the way.
- The first one they wanted to be a simple self-serve UI.
- The second was for template functionality.
- The third one was result quality.
- The fourth one was reducing costs.
In a nutshell, Pinterest started using Amazon mechanical Turk to increase its quality, reduce cost, and increase user experience, and to create the best functionality templates. So in Mechanical Turk terminology, one Sofia job creates multiple hits that a worker can open up at the same time, and this causes some quality and cost problems for them. So to avoid these issues, Pinterest started using mechanical Turk to increase quality and to reduce costs.