Go to file
kaichao 1bc66f7c81 add golang mod def. 2024-05-27 09:08:30 +08:00
dockerfiles update cluster def files. 2024-05-22 14:15:44 +08:00
docs update docs for cmdline. 2024-04-25 20:48:23 +08:00
examples update cluster def files. 2024-05-22 14:15:44 +08:00
golang add golang mod def. 2024-05-27 09:08:30 +08:00
schema add update pubkey in actuator;add README.en-US.md 2023-05-15 23:49:31 +08:00
server update cluster def files. 2024-05-22 14:15:44 +08:00
tests Update check_test to incorporate dir-free-gb and dir-limit-gb. 2024-02-27 20:58:55 +08:00
.gitignore add initial docs for ReadTheDocs. 2024-01-29 21:44:12 +08:00
LICENSE first commit 2023-03-31 17:01:47 +08:00
README.en-US.md add README.en-US.md; update doc 2023-05-21 19:23:17 +08:00
README.md update doc. 2023-09-28 21:41:45 +08:00
bigdata-framework.png update README.md. 2023-05-05 12:03:25 +08:00
hpc-framework.png update README.md. 2023-05-05 12:03:25 +08:00

README.en-US.md

Scalebox - A cloud-native streaming computing engine

Scalebox is a cloud-native streaming computing engine that can run containerized stand-alone user algorithms on distributed and heterogeneous computing clusters, organize large-scale parallel processing at the module level with pipelines, and support task-level fault tolerance. Compared with existing frameworks such as big data processing and parallel computing, its technical characteristics are especially suitable for application scenarios such as data distribution, computing power resource distribution, and complex algorithms.

Scalebox has the following important properties:

  • Cloud-native design: Encapsulate all algorithm modules and transmission modules in containers, and embed them into the data processing pipeline designed for the cloud environment through the sidecar model; the software basic platform is completely based on the cloud-native design; the platform will control messages, The data channel is separated, and the front and back modules are connected by a message bus to realize non-intrusive parallel programming, which greatly simplifies the difficulty of parallel computing.

  • Cross-cluster computing: Normalized processing algorithm module, transmission module, unified processing of intra-cluster/cross-cluster data through the pipeline, shielding data and computing Cross-cluster differences; cross-cluster message-driven stream processing supports the deployment of a single pipeline application on multiple heterogeneous computing power clusters.

  • Task-level fault tolerance: For sporadic errors caused by hardware failures, software bugs, network problems, data anomalies, etc., automatic fault-tolerant processing is implemented based on rules; fine-grained task-level Fault tolerance for trusted data analytics on unreliable hardware.

  • Location-aware scheduling: The IP address of the sender can be configured in the message body, which supports local cascading processing between front and rear modules; separates the processing layer and storage layer to reduce coupling, and messages in the horizontal direction (processing layer) drive the vertical direction Data reading and writing (from the processing layer to the storage layer) reduces the east-west network traffic in the computing cluster and eliminates the I/O bottleneck of the cluster storage; then realizes local computing without shared storage, pure local loading of large files, and effectively supports horizontal expansion.

  • Task Perspective: The computing task is a message-driven process, and the task perspective records the detailed running status of each task in detail; including user program return code, standard output, Standard error, program custom text, number of bytes read and written by user programs, etc., also includes various system-level and user-defined time stamps on the computing container side and control side during the task execution cycle (message generation/distribution/processing, result recording) . Task perspective provides basic support for precise positioning, application optimization, and data statistics in troubleshooting.

  • Multiple parallelization methods

    • Algorithm parallelism within the module
    • Module-level data parallelism
    • Cross-module pipeline parallelism
  • multi-computing backend

    • Multiple types of computing clusters (self-managed clusters, HPC clusters, k8s container clusters, etc.)
    • Various container engines
      • docker: the default container engine
      • singularity
      • k8s: TODO

This repository contains:

  1. Scalebox server environment based on docker-compose (Service Environment)
  2. Dockerfile definition for scalebox standard modules(standard module)
  3. Application example of scalebox (Application Example)
  4. Test of the main features of scalebox (feature test)

Table of Contents

Background

Install

Usage

Examples

Feature tests

Maintainers

@Kaichao.

Contributing

Feel free to dive in! Open an issue or submit Pull Requests.

License

Apache © Kaichao Wu