Cute lil digger on a under construction sign

New site, mind the dust! Please log any issues or suggestions

763

May 1st, 2024 × #web-scraping#data-collection#apis

Web Scraping + Reverse Engineering APIs

Covers techniques for web scraping, dealing with private APIs, handling authentication, parsing HTML, and challenges like captchas.

or
Topic 0 00:00

Transcript

Scott Tolinski

wild weekend. We hit our, like, school auction because our kids go to to public schools here, so they, you know, have fundraisers and stuff. And my wife, Courtney, decided to be the cochair of the auction, which, we've heard is kind of like a hell job. Everybody is often talked about how that's not something that you want to do, but she she decided to step up and try to co chair it up. And my gosh, Node, they crushed it. She, just she raised the they raised the most money they've ever raised here. The everything was, like, went off without a hitch. The event was great. All the parents were happy.

Scott Tolinski

It was just like a wild weekend, but, man, we've been we've been a little stressed over here at the household kind of leading up to this event. So it's, it was really, cathartic to get that out. And, you know, it was such a release that, like, on Sunday, she was just totally wiped. And so it was just like, what a, what an eventful weekend. But, man, it was, it was really cool. I got to do some dancing at the the event too. Did did you auction yourself off? Is it was it, like, one of those? Like, dinner with mister Tolinski? I wish. No. It was, no. It was, like, you know, just do, like, a all the parents donated a bottle of wine so then you could, you know Oh, yeah. Try to win that too. 50 bottles of wine or whatever or different events. People were putting up their vacation homes. The cool thing about Colorado is a lot of parents have vacation homes in the mountains.

Topic 1 02:33

Donating auction items as charitable contributions

Scott Tolinski

can I do that? I I don't know. We we can if it's a a charitable donation. It depends on there's, like, a a value minus the thing. It's a charitable donation, technically. Think about that. Yeah. Since it's donated.

Scott Tolinski

Yeah. Yeah.

Scott Tolinski

I know there was a my dad used to do a guitar auction, and he would get a bunch of guitars from famous musicians.

Scott Tolinski

And the the rule of that JS you could write off anything that was the value of the guitar.

Scott Tolinski

You Node, whatever you spent minus the value of the guitar, you could write off. So, you know, they were cheap guitars, but they would be signed by, you know, famous people. That's what made them valuable. So it was it was nice to be able to to write that kind of stuff off as charitable contributions.

Scott Tolinski

Let's let's say here's stuff you you also wanna write off Wes. You wanna sometimes write off bugs that are coming into your century because somebody's scraping your website and causing, maybe potentially unforeseen issues. Man, we see some wild stuff in our century sometimes, whether it's bots hitting us from, trying to trying to find our WP admin or whatever. But it's nice to have that visibility in the the terms of, like, what's going on in your application? And even if things aren't being used like they're supposed to be being used, you don't want your apps app to crash because, Wes Bos is trying to scrape it and and found something going on here triggers some sort of The amount of, like, code paths that we've hit

Topic 2 03:14

Bots hitting websites causing issues

Scott Tolinski

Absolutely. Well, if you want to have that kind of visibility into your application, head on over to century Scott I o forward slash syntax.

Scott Tolinski

Sign up and get 2 months for free. It's an awesome tool, and I got mine open all the time. So let's get into scraping. 1st and foremost, what is it? Why might you want it? Well, sometimes, you wanna get some data from websites, whether that is behind the login portal or in front of the login portal. Maybe there JS a host of websites, several websites, and you want all that data.

Scott Tolinski

Not every website has an API. In fact, most websites don't have an API.

Scott Tolinski

Not only that, if they have an API, maybe that API costs money to use. And we saw this with Twitter. They pulled the rug on everybody who Wes using the API and all of a sudden the API became prohibitively expensive. Same thing with Reddit. Right? Now all of a sudden, you can't use their API even though the API exists to grab the data.

Scott Tolinski

So even if they have an API, it's not assured that you'll be able to grab data from various websites.

Scott Tolinski

So you might be thinking, alright. If if their app doesn't have a data, am I out of luck? No. That's where scraping comes in because guess what? We have computers, and computers can read websites too. And on websites is more than just, divs and all that stuff. It's text. It's information.

Scott Tolinski

And you can navigate through actual websites with code, grab data, store it, do something with it, analyze it, use it for various purposes.

Scott Tolinski

I think about, like, all kinds of data aggregate systems. There's things like Social Blade that are hitting all the different social accounts and logging their their follower numbers and putting them in big tables so that you can follow. They're not doing that through APIs. That would be just too impossible, especially with things like Instagram, which doesn't even have a public API like that. There's podcast aggregators. We get all kinds of emails from people being, trying to get on the show because they found us on some podcast aggregation website that says, here's the podcast you gotta get on. And those are they're scraping. They're scraping information from the various podcast hosts or the various players, grabbing public information of the charting data, and able to put it into a big table. So, basically, scraping is here to give you access to all of the data on the World Wide Web for the most part. And if so, even if it doesn't have an interface to do that.

Topic 3 08:44

Using web scrapers to get deals on online marketplaces

Scott Tolinski

like, it it it will look for files and moves them into places automatically for you. So I have a Hazel script for me that looks for any file with the word screenshot in the title. So anytime I take a screenshot and puts it into a directory. I just checked my screenshot directory, Wes. It has a 1,000 items. It is a 100 megabytes.

Scott Tolinski

And why don't I just delete this? I should just be dumb dumping this thing periodically.

Scott Tolinski

Why do I need a 1,000 screenshots? I'm never gonna go back and look at these, apparently. Oh, man. So, you can change where the screenshots go in, Oh, yeah. In in macOS. How come you have to have a TypeScript for that? Because it's a part of my overall Hazel flow. So what I don't want is I don't want the configuration for all these things to be in the various apps. When I take a screenshot, it just goes to the the default place. Right? Desktop.

Scott Tolinski

Desktop.

Scott Tolinski

And, likewise, anything that ever lands on my desktop gets run into my Hazel script, and it gets put I see. Into You want it to trigger on anything? Correct. It gets put into the inbox, and the inbox is on my Synology.

Scott Tolinski

That inbox then has a whole host of its own scripts where it then, you know, takes things where they need to go. Yeah. Right. Central station routing. Yeah. So I only have to go in 1 place to ever figure out my automated file routing. Yeah. And you can put anything in there. Like, I have one of those for

Topic 4 13:22

Scraping social media stats over time

Scott Tolinski

I got some good deals on rechargeable batteries at Canadian Tire. I built a web scraper to do that.

Scott Tolinski

Man, I I haven't you know, it's funny that I'm really stoked to be able to talk to you about this because you are kind of like a you're you're a scraper OG. You've been scraping, you know, since I was just first writing some JavaScript myself. So you've been you've been at the scraping game a long time, and the only scraper I've really written is one to grab exercise information from my exercise machine.

Scott Tolinski

They have a host of exercises that are built into this thing, but they don't make it public. So I, you know, grabbed the APIs and have been scraping images, have been scraping all the exercise names and all the informations that I can put together like a custom little exercise builder online for it. It's it's just slow going, and I have a lot of side projects. But that's really the only, like, actual scraper I've spent any time on. They're super fun, and I honestly think you learn a lot about how web tech works. You learn a lot about how

Scott Tolinski

So

Scott Tolinski

the hidden hidden web, which was, like, private APIs and and things like that warp we talked about some of this stuff. And go back and listen to that episode if you're interested more in proxy man because we talk about connecting even to your iPhone to scrape on native apps, not just, web apps. Right? So you can access a lot of stuff with things like Proximan beyond just, clicking around a website if you have a native app. Let's say that's what I was working in. My tonal thing Yeah. Only a native app. So there's definitely a lot of, a lot of cool stuff you can do with that app.

Scott Tolinski

Adobe is checking the date of the Oh my god. Yes.

Scott Tolinski

That's a, throwback to anybody whoever has installed, Adobe Photoshop.

Scott Tolinski

Yeah. It is funny because you are since these aren't, like, versioned APIs, you're beholden to Yeah. The way the application works. And if they change it ever so slightly and the thing you're quartering off of or looking for doesn't exist anymore, it's moved, the document structure has changed, you you you might just have yourself a broken thing. So you might always find yourself chasing, chasing a train here.

Topic 5 25:05

Methods for scraping data from websites

Scott Tolinski

Yeah. And and the reason why that is, in case anybody isn't familiar, is because straight up, when you have client side rendered JavaScript, again, that is rendered on the client. And if you're hitting it from the server, you're gonna get HTML with the JavaScript file. So what these things are doing is they are loading up the application as if it was a real website in a real browser. It's just not showing you or rendering it visually.

Scott Tolinski

And this is the same way that you said Cypress or these types of things. This is the way Cypress or Playwright do testing is by opening up your site in a real browser and then it's almost knocked my coffee Vercel. And then inspecting the DOM, gesticulating too much, inspecting the DOM and making sure things interact the way that they should as if a real user was using it, which is one of the reasons why end to end testing on the web is such a great way

Scott Tolinski

JS DOM? Is it just because it's more modern? Is that it? It it just implements everything and works with everything? Is there any Yeah. Special sauce there, or is it that's it? I don't know why. I know all I know is that I've always just had

Scott Tolinski

That that's nice. And Yeah. And typically, you know, again, you you kinda have to get a lay of the land of what their HTML is looking like when you're working with this Dom stuff because, you know, some people got some really crazy HTML. And I I think that is, you know, one thing that is is good at being able to parse this kind of stuff or really dive into it. So you get to see all the kinds of wild things that are existing in actual production websites, whether that's infinitely nested divs or spans or whatever. You might find yourself really having to to get deep into some, nested structures here.

Topic 6 31:37

Using ARIA labels and test IDs to select elements

Scott Tolinski

That's very fun. Yeah. That would that would be a pain, right, to rewrite all your Wes to hunt for something that's not a test ID, especially with dynamically generated classes.

Scott Tolinski

AI can really, really come in handy in terms of like, if you give it a lot of data, it's always good at going through that data and suggesting you maps or loops or whatever to parse that data. I've I've had nothing but positive experiences there, especially if you know what you're doing because then it outputs the code. You read the code. Yeah. Mhmm. It's good. Works.

Scott Tolinski

Yeah. I agree. And and that I actually had that same situation with several bun situations where there's the file file router API or stuff like that. I'm, like, writing bun software. I'm like, should I be doing this, or should I just be trying to write it so it works with any runtime? You you Node, it's such a a tough

Topic 7 43:21

Dealing with captchas when scraping

Scott Tolinski

bad credit cards and whatnot into my checkout forms. Yeah. Did you see that story about the Amazon walkout stores? Oh my gosh. That it was just a Mechanical Turk situation? Yeah. It it escalated a little. Yeah. From my understanding, there were the Amazon, like, physical stores that you could just walk in, grab your stuff, and walk out. Grocery store. Yeah. Would use your your phone to essentially charge you for everything that you took. But in reality, there were, like, people in, I believe, India that were Yeah. Processing the transactions manually themselves.

Scott Tolinski

Like So they're just watching you pick up things and put them back. Yeah. They they were like they when it was initially announced, it was kind of lauded as a or lauded. I don't know if that's right. Where it was initially Sanity, it's like, hey.

Scott Tolinski

This is some cool, crazy new tech, and and it's just a mechanical turk, which historically, that's like a there was like a a machine that was a computer, and what was inside of it was just people.

Scott Tolinski

So that that's where that that term comes from. It's a very wild concept.

Scott Tolinski

Yeah. Shane Yarn. Node. Sick picks. Sick picks. I'm going to sick pick a small little app for Mac OS, something that I use all the time.

Scott Tolinski

And you might have another tool for this. I don't know if better touch tool does this, Wes, or or I don't know. Any of these apps that you probably have. But keyboard clean tool.

Scott Tolinski

I use keyboard clean tool once a week to clean my screen. I I bought a a bottle of this Zeiss lens cleaner, which is, like, alcohol free, so it's not gonna ruin your screen. Because if you if you use alcohol on the screen, it will remove the remove the shine off of it. So keyboard clean tool, it just makes it Sanity locks your keyboard, locks everything without you clicking and holding on a button for a specific time.

Scott Tolinski

And that makes it just really nice and easy to spray some on a thing and wipe down your keyboard, wipe down your screen, do it once a week, and your computer is not going to be awful. And it's a a free little app, so might as well use it. Right? So I'll I'll post ESLint to that in the show notes. This is by the same developer as BetterTouchTool. So It is. Okay. Yeah. That's awesome.

Scott Tolinski

I do that too. Yeah. I still have to flip to the other one and and and decrement each one. I I do, hyper and then the arrow keys, and it'll move it over more and more to the Node. Or hyper up, and it moves it more and more in the center.

Scott Tolinski

workflows because Let me tell you, Ubay is open source, and it looks really cool. So

Scott Tolinski

Shameless plugs. Check out Syntax on YouTube. We're on YouTube. We Wes release all kinds of stuff, and CJ Coding Garden has been doing a lot of content that's based on shows that we've done Wes he'll do an hour sometimes on deep dives on topics that we do. So if you if you wanna get deeper into some topics, he's been doing a whole host a whole series on self hosting right now, diving deep into managing your own VPS. He's gonna be talking about COOLify very soon. So a lot of cool stuff going on over there. Alright. That's it. Thanks, Eric, for tuning in. We'll catch you later. Peace.