When you want to collect news sources, airline ticket prices, or any events from sites that don't offer APIs, you can use scrapers to grab elements from off the page.
Script Kit includes a scrapeSelector()
helper that takes the URL you want to scrape and the selector you want from the page. Using the // Schedule
metadata, you can also have this script run in the background on a Chron schedule and collect the data for you.
// Name: Scrape Tech News
// Schedule: 0 11 * * *
import "@johnlindquist/kit"
let h3s = await scrapeSelector(
"https://news.google.com/topics/CAAqJggKIiBDQkFTRWdvSUwyMHZNRGRqTVhZU0FtVnVHZ0pWVXlnQVAB?hl=en-US&gl=US&ceid=US%3Aen",
"h3"
)
let filePath = home("tech.md")
await ensureFile(filePath)
let contents =
`
## ${new Date()}
` + h3s.map(h3 => `### ${h3}`).join("\n")
await appendFile(filePath, contents)
Strange, I couldn't get this to work with google tech news: "https://news.google.com/topics/CAAqKggKIiRDQkFTRlFvSUwyMHZNRGRqTVhZU0JXVnVMVWRDR2dKSFFpZ0FQAQ?hl=en-GB&gl=GB&ceid=GB%3Aen" But it worked fine with "reddit.com"
I imagine it's timing out. You can try increasing the timeout to 10 seconds (it defaults to 5):
let h3s = await scrapeSelector(
"https://news.google.com/topics/CAAqJggKIiBDQkFTRWdvSUwyMHZNRGRqTVhZU0FtVnVHZ0pWVXlnQVAB?hl=en-US&gl=US&ceid=US%3Aen",
"h3",
el => el.innerText,
{
timeout: 10000,
}
)