Go to file

kaichao 1bc66f7c81 add golang mod def.		2024-05-27 09:08:30 +08:00
dockerfiles	update cluster def files.	2024-05-22 14:15:44 +08:00
docs	update docs for cmdline.	2024-04-25 20:48:23 +08:00
examples	update cluster def files.	2024-05-22 14:15:44 +08:00
golang	add golang mod def.	2024-05-27 09:08:30 +08:00
schema	add update pubkey in actuator;add README.en-US.md	2023-05-15 23:49:31 +08:00
server	update cluster def files.	2024-05-22 14:15:44 +08:00
tests	Update check_test to incorporate dir-free-gb and dir-limit-gb.	2024-02-27 20:58:55 +08:00
.gitignore	add initial docs for ReadTheDocs.	2024-01-29 21:44:12 +08:00
LICENSE	first commit	2023-03-31 17:01:47 +08:00
README.en-US.md	add README.en-US.md; update doc	2023-05-21 19:23:17 +08:00
README.md	update doc.	2023-09-28 21:41:45 +08:00
bigdata-framework.png	update README.md.	2023-05-05 12:03:25 +08:00
hpc-framework.png	update README.md.	2023-05-05 12:03:25 +08:00

README.en-US.md

Scalebox - A cloud-native streaming computing engine

Scalebox is a cloud-native streaming computing engine that can run containerized stand-alone user algorithms on distributed and heterogeneous computing clusters, organize large-scale parallel processing at the module level with pipelines, and support task-level fault tolerance. Compared with existing frameworks such as big data processing and parallel computing, its technical characteristics are especially suitable for application scenarios such as data distribution, computing power resource distribution, and complex algorithms.

Scalebox has the following important properties:

Cloud-native design: Encapsulate all algorithm modules and transmission modules in containers, and embed them into the data processing pipeline designed for the cloud environment through the sidecar model; the software basic platform is completely based on the cloud-native design; the platform will control messages, The data channel is separated, and the front and back modules are connected by a message bus to realize non-intrusive parallel programming, which greatly simplifies the difficulty of parallel computing.
Cross-cluster computing: Normalized processing algorithm module, transmission module, unified processing of intra-cluster/cross-cluster data through the pipeline, shielding data and computing Cross-cluster differences; cross-cluster message-driven stream processing supports the deployment of a single pipeline application on multiple heterogeneous computing power clusters.
Task-level fault tolerance: For sporadic errors caused by hardware failures, software bugs, network problems, data anomalies, etc., automatic fault-tolerant processing is implemented based on rules; fine-grained task-level Fault tolerance for trusted data analytics on unreliable hardware.
Location-aware scheduling: The IP address of the sender can be configured in the message body, which supports local cascading processing between front and rear modules; separates the processing layer and storage layer to reduce coupling, and messages in the horizontal direction (processing layer) drive the vertical direction Data reading and writing (from the processing layer to the storage layer) reduces the east-west network traffic in the computing cluster and eliminates the I/O bottleneck of the cluster storage; then realizes local computing without shared storage, pure local loading of large files, and effectively supports horizontal expansion.
Task Perspective: The computing task is a message-driven process, and the task perspective records the detailed running status of each task in detail; including user program return code, standard output, Standard error, program custom text, number of bytes read and written by user programs, etc., also includes various system-level and user-defined time stamps on the computing container side and control side during the task execution cycle (message generation/distribution/processing, result recording) . Task perspective provides basic support for precise positioning, application optimization, and data statistics in troubleshooting.
Multiple parallelization methods
- Algorithm parallelism within the module
- Module-level data parallelism
- Cross-module pipeline parallelism
multi-computing backend
- Multiple types of computing clusters (self-managed clusters, HPC clusters, k8s container clusters, etc.)
- Various container engines
  - docker: the default container engine
  - singularity
  - k8s: TODO

This repository contains:

Scalebox server environment based on docker-compose (Service Environment)
Dockerfile definition for scalebox standard modules(standard module)
Application example of scalebox (Application Example)
Test of the main features of scalebox (feature test)

Scalebox - A cloud-native streaming computing engine

Background

Install

Usage

Examples

Feature tests

PostgreSQL Database Management System — Scalebox backend database
gRPC – An RPC library and framework — Efficient communication protocol between different software modules
The Go Programming Language — Programming language for cloud-native applications
Pony ORM ER Diagram Editor - Magical ER Diagram Tool

README.en-US.md

Scalebox - A cloud-native streaming computing engine

Table of Contents

Background

Install

Usage

Examples

Feature tests

Maintainers

Contributing

License

README.en-US.md Unescape Escape

Scalebox - A cloud-native streaming computing engine

Table of Contents

Background

Install

Usage

Examples

Feature tests

Related Softwares

Maintainers

Contributing

License

README.en-US.md