In this section, we present how to use a web crawler within MindsDB.

A web crawler is a computer program or automated script that browses the internet and navigates through websites, web pages, and web content to gather data. Within the realm of MindsDB, a web crawler can be employed to harvest data, which can be used to train models, domain specific chatbots or fine-tune LLMs.

Connection

This handler does not require any connection parameters.

Here is how to initialize a web crawler:

CREATE DATABASE my_web 
WITH ENGINE = 'web';

If you installed MindsDB locally via pip, you need to install all handler dependencies manually. To do so, go to the handler’s folder (mindsdb/integrations/handlers/web_handler) and run this command: pip install -r requirements.txt.

Usage

Get Websites Content

Here is how to get the content of docs.mindsdb.com:

SELECT * 
FROM my_web.crawler 
WHERE url = 'docs.mindsdb.com' 
LIMIT 1;

You can also get the content of internal pages. Here is how to fetch the content from 10 internal pages:

SELECT * 
FROM my_web.crawler 
WHERE url = 'docs.mindsdb.com' 
LIMIT 10;

Another option is to get the content from multiple websites.

SELECT * 
FROM my_web.crawler 
WHERE url IN ('docs.mindsdb.com', 'docs.python.org') 
LIMIT 1;

Get PDF Content

MindsDB accepts file uploads of csv, xlsx, xls, sheet, json, and parquet. However, you can utilize the web crawler to fetch data from pdf files.

SELECT * 
FROM my_web.crawler 
WHERE url = '<link-to-pdf-file>' 
LIMIT 1;

For example, you can provide a link to a pdf file stored in Amazon S3.