sitefetch

Fetch an entire site and save it as a text file (to be used with AI models)

21 Apr 2025 running:

This website is about a tool called "sitefetch" that allows you to fetch an entire website and save it as a text file, which can be used with AI models. It's a command-line tool that can be installed globally or used one-off, and has options for customizing the fetch process, such as specifying pages to match or a content selector. The tool is open-source and licensed under MIT.

It could be useful to you if you need to extract text data from websites for use in AI models or other applications.

npx sitefetch https://knowledge.kaltura.com/help -o ~/test/site-fetch-knowledge-kaltura.txt --concurrency 10

failed with

INFO Fetching https://knowledge.kaltura.com/help/how-to-edit-a-slide
INFO Fetching https://knowledge.kaltura.com/help/exporting-your-powtoon
INFO Fetching https://knowledge.kaltura.com/help/inplayer-integration-with-kaltura-video-portal
INFO Fetching https://knowledge.kaltura.com/help/paywall-administrator-guide
INFO Fetching https://knowledge.kaltura.com/help/inplayer-user-guide

<--- Last few GCs --->

[85657:0x120078000]   519322 ms: Mark-sweep 3992.3 (4132.1) -> 3992.2 (4115.9) MB, 1055.0 / 0.0 ms  (average mu = 0.163, current mu = 0.013) allocation failure; scavenge might not succeed
[85657:0x120078000]   519358 ms: Scavenge 4007.9 (4115.9) -> 4009.6 (4131.9) MB, 23.0 / 0.0 ms  (average mu = 0.163, current mu = 0.013) allocation failure; 


<--- JS stacktrace --->

FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
 1: 0x10288a200 node::Abort() [/usr/local/bin/node]
 2: 0x10288a3f0 node::ModifyCodeGenerationFromStrings(v8::Local<v8::Context>, v8::Local<v8::Value>, bool) [/usr/local/bin/node]
 3: 0x1029d0880 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [/usr/local/bin/node]
 4: 0x102b7b5e8 v8::internal::EmbedderStackStateScope::EmbedderStackStateScope(v8::internal::Heap*, v8::internal::EmbedderStackStateScope::Origin, cppgc::EmbedderStackState) [/usr/local/bin/node]
 5: 0x102b7f1f0 v8::internal::Heap::CollectSharedGarbage(v8::internal::GarbageCollectionReason) [/usr/local/bin/node]
 6: 0x102b7c1e4 v8::internal::Heap::PerformGarbageCollection(v8::internal::GarbageCollector, v8::internal::GarbageCollectionReason, char const*, v8::GCCallbackFlags) [/usr/local/bin/node]
 7: 0x102b7963c v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [/usr/local/bin/node]
 8: 0x102b6e374 v8::internal::HeapAllocator::AllocateRawWithLightRetrySlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [/usr/local/bin/node]
 9: 0x102b6eba4 v8::internal::HeapAllocator::AllocateRawWithRetryOrFailSlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [/usr/local/bin/node]
10: 0x102b549d8 v8::internal::Factory::NewFillerObject(int, v8::internal::AllocationAlignment, v8::internal::AllocationType, v8::internal::AllocationOrigin) [/usr/local/bin/node]
11: 0x102ee49e4 v8::internal::Runtime_AllocateInYoungGeneration(int, unsigned long*, v8::internal::Isolate*) [/usr/local/bin/node]
12: 0x10323104c Builtins_CEntry_Return1_DontSaveFPRegs_ArgvOnStack_NoBuiltinExit [/usr/local/bin/node]
13: 0x1079c9f74 
14: 0x107dce3a0 
15: 0x107de70f0 
16: 0x107d94a38 
17: 0x107d94c48 
18: 0x107d94da4 
19: 0x107e2cd64 
20: 0x107fd3f3c 
21: 0x10819de28 
22: 0x107d951d4 
23: 0x107a3504c 
24: 0x1031edef4 Builtins_AsyncFunctionAwaitResolveClosure [/usr/local/bin/node]
25: 0x10327c6f8 Builtins_PromiseFulfillReactionJob [/usr/local/bin/node]
26: 0x1031dfc4c Builtins_RunMicrotasks [/usr/local/bin/node]
27: 0x1031ba3a4 Builtins_JSRunMicrotasksEntry [/usr/local/bin/node]
28: 0x102afce10 v8::internal::(anonymous namespace)::Invoke(v8::internal::Isolate*, v8::internal::(anonymous namespace)::InvokeParams const&) [/usr/local/bin/node]
29: 0x102afd300 v8::internal::(anonymous namespace)::InvokeWithTryCatch(v8::internal::Isolate*, v8::internal::(anonymous namespace)::InvokeParams const&) [/usr/local/bin/node]
30: 0x102afd4dc v8::internal::Execution::TryRunMicrotasks(v8::internal::Isolate*, v8::internal::MicrotaskQueue*, v8::internal::MaybeHandle<v8::internal::Object>*) [/usr/local/bin/node]
31: 0x102b23bac v8::internal::MicrotaskQueue::RunMicrotasks(v8::internal::Isolate*) [/usr/local/bin/node]
32: 0x102b24444 v8::internal::MicrotaskQueue::PerformCheckpoint(v8::Isolate*) [/usr/local/bin/node]
33: 0x102a411a0 v8::internal::FunctionCallbackArguments::Call(v8::internal::CallHandlerInfo) [/usr/local/bin/node]
34: 0x102a40c9c v8::internal::MaybeHandle<v8::internal::Object> v8::internal::(anonymous namespace)::HandleApiCallHelper<false>(v8::internal::Isolate*, v8::internal::Handle<v8::internal::HeapObject>, v8::internal::Handle<v8::internal::HeapObject>, v8::internal::Handle<v8::internal::FunctionTemplateInfo>, v8::internal::Handle<v8::internal::Object>, v8::internal::BuiltinArguments) [/usr/local/bin/node]
35: 0x102a404c8 v8::internal::Builtin_HandleApiCall(int, unsigned long*, v8::internal::Isolate*) [/usr/local/bin/node]
36: 0x10323118c Builtins_CEntry_Return1_DontSaveFPRegs_ArgvOnStack_BuiltinExit [/usr/local/bin/node]
37: 0x10795b3a8 
38: 0x1031ba4d0 Builtins_JSEntryTrampoline [/usr/local/bin/node]
39: 0x1031ba164 Builtins_JSEntry [/usr/local/bin/node]
40: 0x102afce40 v8::internal::(anonymous namespace)::Invoke(v8::internal::Isolate*, v8::internal::(anonymous namespace)::InvokeParams const&) [/usr/local/bin/node]
41: 0x102afc374 v8::internal::Execution::Call(v8::internal::Isolate*, v8::internal::Handle<v8::internal::Object>, v8::internal::Handle<v8::internal::Object>, int, v8::internal::Handle<v8::internal::Object>*) [/usr/local/bin/node]
42: 0x1029ec8a8 v8::Function::Call(v8::Local<v8::Context>, v8::Local<v8::Value>, int, v8::Local<v8::Value>*) [/usr/local/bin/node]
43: 0x1027d4d00 node::InternalCallbackScope::Close() [/usr/local/bin/node]
44: 0x1027d4fc8 node::InternalMakeCallback(node::Environment*, v8::Local<v8::Object>, v8::Local<v8::Object>, v8::Local<v8::Function>, int, v8::Local<v8::Value>*, node::async_context) [/usr/local/bin/node]
45: 0x1027e98ec node::AsyncWrap::MakeCallback(v8::Local<v8::Function>, int, v8::Local<v8::Value>*) [/usr/local/bin/node]
46: 0x102928edc node::(anonymous namespace)::CompressionStream<node::(anonymous namespace)::ZlibContext>::AfterThreadPoolWork(int) [/usr/local/bin/node]
47: 0x103198cbc uv__work_done [/usr/local/bin/node]
48: 0x10319c458 uv__async_io [/usr/local/bin/node]
49: 0x1031ae1a4 uv__io_poll [/usr/local/bin/node]
50: 0x10319c8e8 uv_run [/usr/local/bin/node]
51: 0x1027d56d4 node::SpinEventLoop(node::Environment*) [/usr/local/bin/node]
52: 0x1028c5128 node::NodeMainInstance::Run() [/usr/local/bin/node]
53: 0x102858ec8 node::LoadSnapshotDataAndRun(node::SnapshotData const**, node::InitializationResult const*) [/usr/local/bin/node]
54: 0x102859190 node::Start(int, char**) [/usr/local/bin/node]
55: 0x185b64274 start [/usr/lib/dyld]
[1]    85655 abort      npx sitefetch https://knowledge.kaltura.com/help -o  --concurrency 10
(base) nic@StudioN ~ % 

need to try with:

NODE_OPTIONS="--max-old-space-size=8192" npx sitefetch https://knowledge.kaltura.com/help -o output.txt --concurrency 10

or lower concurrency:

npx sitefetch https://knowledge.kaltura.com/help -o output.txt --concurrency 5

250421-2310 trying with:

NODE_OPTIONS="--max-old-space-size=40960" npx sitefetch https://knowledge.kaltura.com/help -o output.txt --concurrency 3

didn't work:

INFO Fetching https://knowledge.kaltura.com/help/pdfexport/id/64380d08131b0778692a9423
WARN Not a HTML page: https://knowledge.kaltura.com/help/pdfexport/id/6388eeb98c2c91226d2a734b
INFO Fetching https://knowledge.kaltura.com/help/pdfexport/id/66b4d4befb7013405c05223c
WARN Not a HTML page: https://knowledge.kaltura.com/help/pdfexport/id/64380d08131b0778692a9423
INFO Fetching https://knowledge.kaltura.com/help/pdfexport/id/64511b4b2371ae6505100088
INFO Fetching https://knowledge.kaltura.com/help/pdfexport/id/64511b630f73573ebc15e80b
WARN Not a HTML page: https://knowledge.kaltura.com/help/pdfexport/id/66b4d4befb7013405c05223c
INFO Fetching https://knowledge.kaltura.com/help/pdfexport/id/64cba0c8580e634ef24cce67
INFO Fetching https://knowledge.kaltura.com/help/integrate-kaltura-meeting-api-and-lti-documentation
node:internal/deps/undici/undici:11118
    Error.captureStackTrace(err, this);
          ^

TypeError: fetch failed
    at Object.fetch (node:internal/deps/undici/undici:11118:11)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async #fetchPage (file:///Users/nic/.npm/_npx/d3d6b3eceb5acd4a/node_modules/sitefetch/dist/src-vKvD6otA.js:2159:15)
    at async file:///Users/nic/.npm/_npx/d3d6b3eceb5acd4a/node_modules/sitefetch/dist/src-vKvD6otA.js:553:21 {
  cause: ConnectTimeoutError: Connect Timeout Error (attempted address: knowledge.kaltura.com:443, timeout: 10000ms)
      at onConnectTimeout (/Users/nic/.npm/_npx/d3d6b3eceb5acd4a/node_modules/undici/lib/core/connect.js:237:24)
      at Immediate._onImmediate (/Users/nic/.npm/_npx/d3d6b3eceb5acd4a/node_modules/undici/lib/core/connect.js:206:11)
      at process.processImmediate (node:internal/timers:471:21) {
    code: 'UND_ERR_CONNECT_TIMEOUT'
  }
}

Node.js v18.12.0

links

social