Description
Seems like Node ReGrid can get 3x more writes than .NET; yielding faster upload wall time. See image below (credits @buskila):
Test setup
Upload only:
File Size: 1 GB.
Server: RethinkDB / Linux / Ubuntu 14, 3 nodes
Client: .NET Core / Linux
Chunk Size: Default
Batch Size: Default 8 -> 32
They tried single connection and connection pooling. No difference.
Using Stream IO:
// Upload a file using an IO stream
Guid uploadId;
using( var fileStream = File.Open("C:\\video.mp4", FileMode.Open) )
using( var uploadStream = bucket.OpenUploadStream("/video.mp4") )
{
uploadId = uploadStream.FileInfo.Id;
fileStream.CopyTo(uploadStream);
}
Suspicion
Too much chunk calculation in stream upload code. Try to parallelize / simplify some of this, especially when given byte[]
.
Node's ReGrid upload code is here:
https://github.com/internalfx/regrid/blob/master/lib/upload.js
Other notes
This should come after #77 is done.
After some discussion with @interalfx (thanks a bunch), the upload code is using node streams. Node streams info via @buskila:
Using .pipe() has other benefits too, like handling backpressure automatically so that
node won't buffer chunks into memory needlessly when the remote client
is on a really slow or high-latency connection.
https://github.com/substack/stream-handbook
Currently, @internalfx runs 10 network requests in flight at any given time. In a scenario where there is infinite network latency, node won't write to the ReGrid API until at least 1 network request completes.
Cool. I think we could maybe do the same with 10 async tasks laying down bytes over a connection pool then as they complete, then come back read more bytes as network requests complete.
Other Research Findings
RethinkDB Limitations
Query size (419554663) greater than maximum (134217727).
So batch size can't be too big, Max query size is ~130MB something. So only ~130MB per batch max.
Activity