How to scrape the web for LLM in 2024: Jina AI (Reader API), Mendable (firecrawl) and Scrapegraph-ai

The Nugget

  • Web scraping in 2024: Innovative web scraping tools like Jina AI's Reader API, Mendable's Firecrawl, and Scrapegraph-AI leverage large language models (LLMs) for efficient, human-readable data collection, enabling up-to-date insights and market research directly from competitors' websites.

Make it stick

  • 🖥️ Jina AI's Reader API converts any URL to clean data by prepending "eng.com" before the URL.
  • 🛠️ Firecrawl by Mendable: Efficient web scraping tool that outputs structured Markdown from websites.
  • 🧩 Scrapegraph-AI: An open-source tool using multiple Python modules to create web scraping pipelines.
  • đź’¸ TikToken: Tool to measure token costs for scraped content using OpenAI's encoding schemes.

Key insights

Evolving Tools for Web Scraping

  • In 2024, startups are pivoting towards web scraping using large language models.
  • These tools aim to deliver up-to-date information critical for applications like language models (LMS) and search engines.

Mendable's Firecrawl

  • Early Concept: Initially seen in chatbots for documentation sites.
  • Current Usage: Firecrawl specifically designed for web scraping.
  • Output: Provides clean, structured Markdown data.

Jina AI's Reader API

  • Simplicity: Embedding models accessible without an API key.
  • Usage: By adding "eng.com" before any URL, returns clean data from that website for free.
  • Output: Produces human-readable, cleaned-up text, simplifying analysis.

Scrapegraph-AI

  • Open Source: Complex orchestration of Python modules.
  • Pipeline Creation: Allows for graphical pipeline setups to scrape websites.
  • Flexibility: Can perform multiple steps before returning final output.

Practical Implementation

Market Research Example

  • The narrator uses these tools to scrape competitors’ pricing pages.
  • By executing the scraping process: Jina AI, Mendable Firecrawl, and Beautiful Soup were tested.
  • Comparison: Beautiful Soup can often be blocked and is more cost-intensive, while Jina AI and Firecrawl provide cleaner, more cost-effective outputs.

Token Costs using TikToken

  • Used to estimate the cost required to run large language models based on the number of tokens.
  • Practical Insight: Knowing and managing token costs helps optimize expenses when using LLMs for web scraping.

Extraction of Insights

  • The experiment aimed at extracting pricing tiers from competitors’ websites.
  • Results: Showed that tools like Firecrawl and Jina AI can provide different levels of data readability and token cost efficiency.

Key quotes

  • "It's great, guys. They have bags of VC money they can blow - just like free rides with Uber in the early days."
  • "Web scraping manually using Beautiful Soup is about ten times more expensive than using third-party tools like Jina AI or Firecrawl."
  • "For a large language model when it’s being encoded, maybe you can explain this better than me because I’m a software engineer..."
  • "GP4 and GBT-40 managed to reduce costs due to advancements in tokenization techniques."
  • "Seems like most of these tools pass the test; it's just a matter of whether or not you want to burn 10 times the amount of money."
This summary contains AI-generated information and may be misleading or incorrect.