Archives – RIPE 80

Plenary session
At 3pm:

CHAIR: Hello everyone. I hope you are back. My name is Alexander Azimov and, together with Franziska, I will be chairing the next session. During this, we will have two talks. The first one will show the way to detect scam web shops and the second will show experience with delegated RPKI.

And I want to remind you that you can ask questions using Q&A button and if you can please keep them short.

And now I am going to invite the first speaker, Thymen Wabeke, who will tell us about detecting and taking down counterfeit web shops.

THYMEN WABEKE: I hope you can all hear the slide and see my voice and see my face. Today I want to tell you something about our academic paper in which we took down quite a number of counterfeit web shops and I would like to start with an example of such a web shop straight away.

So, here is an example of such a web shop, it's called nederlandwebshop.nl. It's a well‑designed web shop. There are some popular brands on the page. The domain name itself looks trustworthy and there are some good discounts here. So the question is, would you buy something on this website? I think most of us, if we are looking for some clothes, we will buy on this web page. It looks trustworthy. However I have to tell you this web page is a fake one. If you buy something here, nothing will be delivered or a fake lower quality version of the original product will be delivered.

So, people are being scammed. And, of course, if nothing will be delivered it's a clear story, it's a clear scam. But also, if a fake version is being delivered, we think that it's a scam. And that's because it's about expectation. So, suppose you're on a street market on holiday somewhere in the future and you want to buy a pair of sun glasses, on the street market you can have sort of an expectation that it might be a fake product.

However, estimating whether a product is fake is really difficult as you are buying online in web shop. You only have a small picture and it's very difficult to assess.

So, as a result, people are being scammed and they experience losses. These losses, they have been reported in the news, so here on the right side of the slide, you can see a screenshot of an nos.nl, which is sort of the Dutch BBC, and here is a headline stating that consumers have been scammed for around 5 million euro because they bought something on fake web shops.

And as a result of those losses, the trust in the Internet may decrease, and in the end it's the registry of the .nl DNS zone and therefore we have a high interest in the trust of consumers and Internet users. We don't want them to want their trust in .nl or the broader Internet to decrease.

So, that's our motivation as a DN, but we also have the ability to do something here.

We are the registry, and, because of being a registry, we have the zone file or the list of all .nl domain names, and we not only have this list, we also have data related to those domain names. For example, we have the registration data. But we also have some active and some passive measurements. We, for example, have a crawler called G‑map that crawls the .nl zone on a regular basis and we collect, for example, data about the HTTP website but also about the TLS infrastructure and the mail services that are being used.

And in this search in this project we use that data to find patterns and to detect counterfeit web shops. So what are the results so far?

We started with this project a couple of years ago in 2016 and, since then, we have detected and taken down thousands of suspicious domain names, and, by doing so, we have ultimately, of course, protected users from being scammed.

In this talk, I will focus on the academic work that we performed and this work is centred around two case studies. One case study used to brand counter detector and we did this case study at the beginning of 2018. And the other case study is centred around FADE and that case study took place at the beginning of 2019.

Of course, if you start with an academic work, you have to come up with research questions. You can see the first one is related to the scale of the problem, how many of such fake web shops are there within .nl zone?

The second question is the actions, the things that we can do about those domain names, how can we actually take them off line?

And the third question is more of a reflective question: Given that we detected some counterfeit web shops, what can we learn from those web pages, from those domain names? Are there any patterns that tell us something about how the counterfeiters operate?

The first question, I will process them twice, so once for brand counter and one for FADE, the FADE detector, and the third research question I will address that one at the end of the presentation.

So let's start with brand counter. Like I told you, we stumbled upon strange web pages, strange web shops in 2016. During that time we were doing some other research related to phishing domain names and we saw quite some web shops that sell Nike shoes, clothes and other luxury goods, at very high discounts, and we observed something. Those web pages had very long page titles with all those brand names. So that gave us an idea. What if we just draft a list of brand‑related words and we crawl .nl domain names and count all the terms that are in those titles. And if the number of suspicious terms in the titles are above a threshold, then we mark that domain name as being suspicious.

So this is a very simple met method and we designed this method just to see whether there are maybe 10, maybe 100, maybe 1,000 .nl domain names that are probably perhaps related to fake goods being sold.

The method is simple, but it worked really well, and you can see that on this slide. What you see here on the X axis are different scans that we performed, different crawls that we performed on .nl zone and on the Y axis you see the number of suspicious domain names that we detected via brand counter.

The purple bars indicate the number of suspicious domain names that we detected and you can see that the first time we detected over 12,000 suspicious domain names. That's really a lot. Then the second time you can see with the green area that we detected 4,000 new suspicious domain names, and you also see in this graph that we performed a couple of notifications, and I want to say something about those notifications.

Because we detected quite some suspicious domain names and we asked the abuse department to manually investigate all those domain names. And then we had a list of around 4,000 true positives, domain names that were indeed, according to our abuse experts, fake web shops. But we ourselves, as being a registry, have limited possibilities to do something about those domain names. And that's why we decided to partner with a registrar, in this study we partnered with registrar A, because this registrar alone was responsible for over 40% of the fake web shops. And the results of this notification study you can see it on the right side of the slide, so, the first notification in January 2018 was the largest one. Here, we notified over ‑‑ or almost 3,500 domain names and the green area ‑‑ or, sorry, the blue area indicates the successful notifications and the orange ones were not successful.

In total, during this notification study, we took down over 3,700 domain names, which is 90% of the total set of fake web shops. So that's a great success. But to be honest, it was not the end of the project.

As you can see on this graph, you see that the number of web shops that we detected decreases quite rapidly after a notification. And that can, of course, tell you different things. So one of the things that could happen here is that the counterfeiters have just given up. But the counterfeit industry is a huge industry, and they make a lot of money by scamming people. So we think something else was happening here.

We think the counterfeiters somehow learned to avoid our brand counter detector. And so, to see whether that was actually the case, we decided to come up with a new detector and this new detector is called FADE detector, and there are two requirements for this detection system. The first requirement was that it should not be dependent on the page titles, because that was the main assumption for brand counter and we would like to investigate whether we could find more domain names with a different approach.

The other requirement was that it should not be biased too much towards our own perspective, because if you are dealing with a problem for a couple of years, somehow, by nature, you will get some bias.

So, to come up with a solution, we decided to collaborate with ICS, International Card Services, who is a major credit card provider in the Netherlands, and this organisation they provided us with a list of true positives, so counterfeit web shops that were actually involved in scams, that were 231 domain names. We used the supervised machine learning technique to train a classification model, and the goal of this classification model was to discriminate between suspicious and legitimate web shops.

So, how this process worked of generating a machine learning model. The process I plotted on the slides, so we start on the left side, of course you need to have a dataset, a ground through examples. In our example we started with the data that was provided by ICS cards, so the 231 counterfeit web shops and we also randomly sampled and equal number of legitimate web shops which we manually verified. Then we crafted a couple of features. We used 9 features here, so 6 are related to the registration itself. So, for example, we used a registrar that was used by the registrant. But we also used the e‑mail provider that the registrant uses and also the hour of the day at which the domain name was being registered.

The infrastructure related features were, for example, whether GLS was used on the web page or whether an Amex e‑mail server record was set and we also looked at the AS that was used for the web server.

Then we splitted the dataset into training and test samples. 80% of the data, we used that for training and we used a support vector machine for training. A support vector machine is well sort of a traditional classical machine learning algorithm, but it's very powerful, it's not too complex, you don't have the risk of, for example, over‑fitting. An SVM has a couple of settings that you can tune, we used search to tune those settings.

After some experimenting, we were quite positive about our model and then we used the 20% testing samples to evaluate our model. And you can see the results of this evaluation on the bottom of the slides. So we look at precision, which means how many of the domain names that you detect are actually fake. So, the true positive rate. And the recall tells you something about whether you do detect all true positives, or fake web shops in our case. Both metrics as you can see in the table, they are pretty good.

So that's why we decided to apply the machine learning model that we developed to the .nl zone. But we didn't apply it to the full zone, we decided to only look at a subset of 30K e‑commerce domain names, because we didn't want to stress out the people at ICS, the credit card issuer, and our own abuse desk with too many false positives because they had to manually verify each domain name. So that's why we start with a subset.

So out of those 30K .nl domain names, around 1,400 were classified as being suspicious, and those were annotated by the people at ICS and SIDN and you can see the result of that on the bar chart at the right side. So the majority of the domains were indeed true positives. They were fake web shops. But there were also some fake positives and some of the domain names were not reachable at the time of annotation, so that's for example, because the people already cancelled the domain name themselves or something else happened here.

And the true positives, they ‑‑ we sent notification for those domain names to a couple of registrars, so here it's different from the first case study. Because then we only contacted registrar A, but here multiple registrars are involved. But even though we had to contact multiple registrars, we were able to take down most of the domain names.

In total, we took down 774 of the 894 true positives, which gives us a rate of about 84%, which is really good.

So these are the two case studies and we were able to detect and take down thousands of domain names. But then we were wondering, could we learn something by taking a closer look at the domain names themselves? Or in other words, how do the counterfeiters operate?

Well, the first thing that we observed is that there were thousands of domain names. So it seems like there is sort of a production farm of fake web shops. And if you want to register a lot of domain names, and I think we all would do that, then you generate a script to do that for you. And that's also what we see in the data. We see a lot of hints that point towards automation. So, for example, we see that registrars that offer APIs are popular, and of course cheap registrars are also very popular. What we also see is that the people behind those shops, they tend to use the same template for their websites but they often change a couple of things. So, for example, the colour, a few words, brands, these kinds of things.

What we also see is that most of the domain names that were domain names that were registered in the past. So, these are drop catch domain names. You can see that also in the plot on the right side. What you see here is the days in between the domain name expiration, so the date that the domain name became available again, and the moment of reregistration, and you can see that 60% of the domain names is registered at that date. So there are no days in between.

About 90% is registered within five days. So these are all hints towards automation.

Then the next question is of course, why would you automate things? Why would you register thousands of domain names? You could also invest in one or two domain names and really pay attention to those. But that's not what they do.

We think they don't do that because if you register thousands, it becomes more difficult to detect them all. Like, for example, rent protection agencies or government agencies they can perhaps find a few fake web shops via Google or other ways. But it's very hard to detect them all.

So the fact that domain names are cheap and that they are disposable, we can also see that in the data. So, for example, we see that the domain names do not match the content. That's probably also due to the automation. So, for example, we have web shops selling shoes with a domain name that's from a bakery or something, so that's a strange thing. We also see a lot of spelling mistakes and translation errors on the web pages. And what we also observe is that most of the fake web shops, the domains that are involved there, they have really short lifetimes. So you can also see that on the plot on the right side. Here you can see when the domain names is being cancelled and for .nl the description period is a year, so you can cancel your domain name after a year usually, and what you see is that most of the domain name, they are not renewed after a year, so that's 80%. And 90% is not renewed after two years. So this is a different thing that we see with regular web shops. Usually people really take care and invest ins those domain names.

So then another thing that we learned or that we observed is that most of the registrations, they seem to come from China. On this plot you see on the Y axis the number of suspicious domain name and the bars highlight different e‑mail providers that are being used by registrants. So we see, for example, that 163 dotcom and seena dotcom are very popular within the subset of fake web shops.

However, these e‑mail providers, they are not very popular in the Netherlands; we don't see that in the full zone that much. So this hints towards China because those providers, they are very popular within China.

Another hint towards registration from China is the hour of registration. So, here you can see the number of suspicious domain names plotted with the hour of registration and on the top you see the hour with the Amsterdam time zone and then you see from the Amsterdam time zone which you would expect for .nl domain names. Most domain names are being registered during the night in Amsterdam. However, if you take a closer look at the hour of registration in Bejing, China, then you see that the hour of registration sort of maps the working hours. And if you look closely, then you can see also a dip, perhaps the lunch break or late coffee break around 11 in the morning.

But not everything hints to China. If we take a look at the ASs that are used for hosting, we see different things. Well, we see ASs that belong to many different countries; for example, in the Netherlands but also in Turkey or in the United States.

So, let's summarise some things:

The main thing in this research is that we helped to take down over 4k counterfeit web shops and I think that's a great thing because we really had impact in the world, and for a researcher that's, of course, very cool.

And we also learned a couple of things. One of the things that we learned is that it's really important to collaborate. We, as being a registry, could, of course, have thousands of domain names, but without partnering with registrars and ICS it was really difficult to take them down.

Another thing is that our detection systems, they are very simple, but on the other hand, they are also very effective. And this shows that the counterfeiters perhaps didn't perceive that much pressure before we started this project.

Another thing that we learned is that, for this type of abuse, if it's really like a scattergun approach where you have thousands of domain names, registries have a very good vantage point because we can overview the whole zone, crawl things and see patterns there.

Another thing is that although the counterfeiters perhaps didn't perceive that much pressure before we started the project, they will perceive pressure now and it will be like what can a whole game that will probably continue forever. So that's why we constantly monitor the features, check why they work and we also evaluate our system and it's not only a thing that we do as researchers, but we also do that together with the support team because they are really good at ‑‑ they have a kind of a machine learning model in their head and they know in their mind which features are perhaps useful for us to implement.

On the right side we also ‑‑ I also added a plot of your current system so we added a feedback loop to the machine learning model, so that every notification that we send to the abuse detection, the abuse people will give annotation to that so they will provide feedback and we use that to retain and improve our model.

So, with that, I would like to end the presentation. And I'm very happy to answer your questions.

(Virtual applause)

CHAIR: We have a couple of questions in the Q&A. The first two come from Carsten Schiefner, and since we have only six ‑‑ we'll see. Maybe we'll take one of them.

The first one is: "Why have you, as a ccTLD registry, stepped into websites content business? What about other unwanted content under .nl TLD then? Thanks."

THYMEN WABEKE: That's a good question, because we started with this topic, we perhaps also want to look at other types of abuse. I think we're very much open to that. But just to make sure that everybody is on the same page, we don't take down the domain names because of content. We ask the registrants ‑‑ the registrars, sorry, we ask them whether they want to delete those domain names or not, so it's really up to them. And some of them make different decisions than others.

CHAIR: Okay. I will take the next questions because it is bound to the previous one. Has SIDN as .nl registry already been held responsible for self notice takedown action?

THYMEN WABEKE: No. As far as I know, we're not.

CHAIR: Okay. The next question comes from Antonio Prado: "I am not sure ‑‑ did you discuss about in the CNDR consult in order to involve more ccTLD registries?"

THYMEN WABEKE: Yes. We used association of ccTLDs within Europe and we have very close contact with them and also related to fake web shops because not the problem that it's only happening within .nl, other zones experience this problem. So we do help each other by sharing approaches and ideas. Currently we do not share models for data, perhaps that will happen in the future, but that's, well, a bit more challenging because data can also be sensitive.

CHAIR: Okay. Thank you. Let's move forward. The next question comes from [] private contributor:

"Isn't it suspicious that more than 40% of counterfeit domains are registered on the one registrant? Did you try to analyse why fraudsters used this registrator?"

THYMEN WABEKE: Yes, we did, but it's a bit speculative there. What I think is, and it's also what we see, is that the most of the registrants' relationship, the ones that are used, are the ones that are cheap and they provide automation and this particular registrar was also very cheap and had APIs, so we think that might explain why this one was very popular.

CHAIR: And next question comes from Steinar Grotterod: "Any idea why dropped domains are preferred?"

THYMEN WABEKE: I think that's because the counterfeiters are perhaps a bit dumb. They just use automation and perhaps they don't know the Dutch language so they just pick a domain name that becomes free and perhaps they make the assumption that that domain name ‑‑ at least it meant something in the past because it was registered before so it's not a random string. So perhaps that's their way of thinking. I don't know for sure.

CHAIR: The next question from Ben: "Did you find network hosters on AP level constantly DNS records were sent to or is the use of hosters as dynamic as the user of domains?"

THYMEN WABEKE: That's a good question. We didn't have a close look into that. But if we take a look at the things over time, so the first detection system and the second one, that there is a difference but it's not a great difference, if I remember correctly. But it's an interesting question, so I think we should pay some more attention to that in the future.

CHAIR: And the question comes from ‑ from ETH Zurich ‑‑ sorry, I make mistakes in the name:

"I am wondering whether in the future more advanced features about the website content itself could be used? They mention spelling mistakes."

THYMEN WABEKE: Yeah, I think we could use these type of features but there are plenty of others that we consider. So, for example, we currently we also are trying to see whether we could somehow map the distance between the domain name and the content that you would expect and focussing more on the content is also a thing that we could add. However, currently we don't use the HTML content any more because that's a more extensive process from a computational side. So if you want to process all the text there, and we don't need it at the moment, because the performance is still good, but I can imagine that in the future we want to do that.

CHAIR: And I think the last question will be from: "I wonder how much the fraud damage was caused by these 4,455 fake web shops? This must be huge. Are there any numbers from the credit card companies on this?"

THYMEN WABEKE: No, there are not, and it's a very interesting question. It also keeps me wondering sometimes because we took down thousands of domain names, but I think we protected a lot of consumers with that. If they only make one or two losses, then it's perhaps not a big deal. So that's ‑‑ yeah, it's really difficult to assess that. Given that the domain names that we started with, most of them were actually reported by consumers to the credit card company, I think that's at least an indication that there are plenty of losses, because not everybody reports to their bank, so, yeah, it's ‑‑ I think the numbers are huge, but if somebody knows how to assess that, I would love to hear your opinion on that.

CHAIR: Okay. I am cutting the line. Thank you very much for these reports. I think our audience have enjoyed it and I enjoyed it too. Thank you very much.

THYMEN WABEKE: Thank you. You're welcome.

FRANZISKA LICHTBLAU: Okay. Thank you. Our next speaker is Cynthia Revstrom and she will talk to us about delegated RPKI.

CYNTHIA REVSTROM: Thank you. Let's see if the screen works.

So. This talk is about delegated RPKI, and sort of some of the differences between RPKI ‑‑ sorry, hosted RPKI and delegated RPKI. So, what is delegated RPKI?

So, well, the main difference between hosted RPKI and delegated RPKI is whether ‑‑ well, sorry, I am a bit nervous. The main difference is whether the RIR or NIR hosts your repository, or if you host it yourself, like on internal infrastructure.

So there is obviously a few practical differences in terms of what the pros and cons are. The main ones is that it's a lot to do a hosted RPKI, obviously. You don't have to manage any of the RPKI infrastructure and it's very secure, at least from what I have seen from like RIPE NCC, and I am not entirely sure on the other ones but I assume they are quite secure as well.

But there are some down sides. Like, you will have a separate interface for each RIR in terms of TTL web interface and API. And some API features may not be available under certain RIRs.

This was the case with ARIN until like a few ‑‑ a month ago maybe, that they didn't allow listing and deletion of ROAs via API, only addition.

So, the main benefits of delegated RPKI would be that you have more flexibility, and you can use the same API and interface for multiple RIRs, which can simplify things if you are like a large company, that might be like in APNIC, RIPE, whatever, ARIN. So you have one interface for all of them.

But, I want to learn a bit more about this in practice and not just the theory. So, what did I do? Well, I set up ‑‑ or I had requested some resources from ARIN, a /24 and then I set up my, what would we call it, delegated RPKI with them, and this was pretty much a screenshot that I really enjoyed from the process because they apparently do some of this manually, which is a bit odd at delegation, but ‑‑ yes, so you simply submit ‑‑ so I submitted XML request so ARIN, and then I got back this response. Well, then, I just put that on my QULCA, that is I will get a bit later, and I was up and running. And as confirmed by routinator on Twitter, they saw mine there, it's quite a low resolution and I don't think you can really see it, but yeah, I am in there.

However, it wasn't quite that smooth. So, as I will get a bit later, this is not quite the case any more, but at the time I did it. Everyone was still a bit weird in terms of the format and the use and proprietary‑ish, but very similar format for the XML files to what RIPE NCC uses and what QUL generates. It was just a big mess because there wasn't much documentation on this at all. There was some like XML schema document in some GitHub repository. That was about it. But it was pretty much the same, but they were incompatible. And so I just ‑‑ what I did was just modify it from like how it looked in the scheme and eventually it just accepted it because apparently it did some validation before allowing me to submit the form.

And that once again when I got the response, this was ‑‑ not sure whether this is highlighted, but I ‑‑ sorry, they are a bit out of order, but the responses were a bit different for some reason. Because ARIN ‑‑ I'm not quite sure ‑‑ so they had, for some reason, they had two different basics code certificates in here, which I am not quite sure why, but, yeah ‑‑ and ‑‑ I only needed the one, so, in the new format, I'm not sure what the other one is. I can't really tell you.

But this is basically the format it uses. And this is sort of my ‑‑ the main part, or the interesting thing. So, after doing this, what were my conclusions? That's mainly should you implement delegated RPKI? And yeah, maybe. If you're a large multi‑RIR organisation that have lots of resources and especially different ASNs for different reasons and whatever, whatever maybe consumer ASN, like a backbone ASN, then it could make sense. Do you plan on automating the RPKI, as in will you actually use the RPKI? Because, if not, it doesn't really help that much. But it can still be nice. But, most importantly, can you afford the resources, as in the server, potentially HSM, etc., to make sure it's stable and secure and set up like for redundancy and other things, because well ‑‑ yeah, even though it's fail safe, you know, you would want it to not go down.

But, should you implement RPKI at all? Well, I think you have heard this from quite a lot of people. Yes! Yes, you absolutely should implement RPKI.

However, once again, scaleability. That was the other issue with delegated RPKI. So, every validator that validates our ROAs, as in routinator or RIPE NCC RPKI validator, or whatever, I can't remember exactly what it's called ‑‑ needs to fetch all the ROAs and relate the repository information from every repository, which means that if we have 1,000 repositories, then it becomes a bit weird and, like, it can be quite complicated to maybe troubleshoot them and such. And if every validator needs to fetch ROAs from potentially thousands of repositories, so it's sort of a solution that's a bit odd and maybe I don't think everyone should do it, but, like, there are use cases for it.

So, Krill is magic. Krill is a piece of software made by NLnet Labs, the same people who have made Routinator, and it's a tool that really simplifies setting up the RPKI CA and repository. But it really ‑‑ it actually has a web UI, and it has improved a lot, and I think it's version 6 now, but it's really good and I didn't have much issues with it at all.

So, in Krill 0.5, they added a feature to the web UI to help with the weird ARIN format. However, this is my sort of slide that didn't quite make it into the PDF yet, but this is just a small list of the changes since these slides were made.

So, first of all, ARIN uses proper format now after since like about a month ago or maybe two.

They added ARIN format in web UI in 0.5 and removed it in 0.6. There has been a load of new delegated RPKIs.

Thank you. I can take questions now.

FRANZISKA LICHTBLAU: Thank you, Cynthia. I need to get an overview, because we have got a couple of comments. Asking as a private person: "Could you please summarise this talk in one sentence that a non‑technical person could understand what it is about?" I'm not sure how easy is that? You can take a try.

CYNTHIA REVSTROM: Maybe a non extremely tech person and ‑‑

FRANZISKA LICHTBLAU: Or not too deep in that topic.

CYNTHIA REVSTROM: Okay. So, it's ‑‑ the difference between hosted RPKI and delegated RPKI especially, do you want to manage it internally in your own infrastructure, which may be a requirement, or do you want to have it on the RIR's infrastructure? And in most cases you should have it on the RIR's infrastructure, in my opinion, because it doesn't really make a lot of sense in a lot of cases to have it on your own infrastructure. Yeah, I think that's the best one I can do.

FRANZISKA LICHTBLAU: Okay. Thank you. I think most of the comments that we have got evolved around the fact that ARIN was not running RFC compliant but now as you also stated changed that and I think with that, answered most of the comments and questions we got, except for the request whether your shark was purchased at the Swedish furniture store?

CYNTHIA REVSTROM: It was indeed purchased at a certain Swedish furniture store, which is four letters.

FRANZISKA LICHTBLAU: We got just one quick question that we could still fit in. Ayane Satomi from Batangas State University: "How exactly is self‑hosted RPKI usually helps an on‑premise infrastructure in some cases?"

CYNTHIA REVSTROM: I'm not really sure how it would help. But, like, it's more if maybe ‑‑ yeah, I don't really know how to answer that question because it doesn't really help.

FRANZISKA LICHTBLAU: Maybe you can also take that off line.

CYNTHIA REVSTROM: Find me on ‑‑ yeah, RFCs, Twitter...

FRANZISKA LICHTBLAU: So, with that we close this session. Thank you very much, Cynthia. Thank you also to the commentators and the questioners. We will again have a 15‑minute break. The next session starts at 1600. And yes, remember to rate the talks, open the RIPE website. Go to the programme, log in with your RIPE NCC account, rate the talks and make the life of the PC much, much easier and also vote for our, I think by now, five or six PC candidates in order to influence who will make the selection for the next RIPE meeting.

So, thank you very much. Enjoy your break.

(Virtual applause)