APIs
You can learn the following topics from this page:
- Create or modify the source code of a benchmark to execute on the LegoSim.
Create Source Codes of Benchmarks
Considering the complexity of the software stack of heterogeneous systems, it should not be expected that there will be a standard software stack available for every experimental platform on LegoSim. Hence, task partitioning and task management should be done manually. Each simulator should have one individual executable file. For example, different executable files should be provided to SniperSim and GPGPUSim. Moreover, imported simulators can share the same executable file if they perform the same task on various datasets, like Same-Task-Multiple-Data (STMD).
Take the matmul benchmark as an example. In matmul benchmark, the CPU generates source matrixes and sends data to GPGPUs, while three GPGPUs perform the matrix multiplex operation. Each GPGPU handles a part of the data. The matmul benchmark provides two executable files for SniperSim and GPGPUSim. Simulation processes of GPGPUSim share the same executable file.
Several APIs should be added to benchmarks to communicate and synchronize processes. In reality, similar APIs are also injected by one complete software stack (like CUDA).
API Lists
TODO: non-blocking APIs.
Communication
The source sends data to the destination by sendMessage
while the destination receives data from the source by receiveMessage
. The source and destination addresses are the same among sendMessage
and receiveMessage
.
APIs for CPU
syscall_return_t sendMessage(int64_t __dst_x, int64_t __dst_y, int64_t __src_x, int64_t __src_y, void* __addr, int64_t __nbyte);
syscall_return_t receiveMessage(int64_t __dst_x, int64_t __dst_y, int64_t __src_x, int64_t __src_y, void* __addr, int64_t __nbyte);
APIs for CUDA
cudaError_t sendMessage(int __dst_x, int __dst_y, int __src_x, int __srx_y, void* __addr, int __nbyte);
cudaError_t receiveMessage(int __dst_x, int __dst_y, int __src_x, int __srx_y, void* __addr, int __nbyte);
Arguments
__dst_x
and__dst_y
specify the destination address.__src_x
and__src_y
specify the source address.__addr
specifies the pointer to the data array.__nbyte
determines the number of bytes in the data array.
Return value
- APIs for CPU return the result of the operation.
- 0 means the transmission operation succeeds.
- 1 means the transmission operation fails.
- APIs for GPU return one value of
cudaError_t
.
Lock and unlock
lock
and unlock
are used to manage the critical region. lock
blocks the process until the mutex is locked by this request. unlock
releases the mutex.
APIs for CPU
syscall_return_t lock(int64_t __uid, int64_t __src_x, int64_t __src_y);
syscall_return_t unlock(int64_t __uid, int64_t __src_x, int64_t __src_y);
APIs for CUDA
cudaError_t lock(int __uid, int __src_x, int __src_y);
cudaError_t unlock(int __uid, int __src_x, int __src_y);
Arguments
__uid
specifies one unique ID of the mutex.__uid
should not be the same as any one address in the system.__src_x
and__src_y
specify the source address.
Return value
- APIs for CPU return the result of the operation.
- 0 means the transmission operation succeeds.
- 1 means the transmission operation fails.
- APIs for GPU return one value of
cudaError_t
.
Barrier
When calling barrier
, processes are blocked until a certain number of processes enter the barrier.
APIs for CPU
syscall_return_t barrier(int64_t __uid, int64_t __src_x, int64_t __src_y, int64_t __count = 0);
APIs for CUDA
cudaError_t barrier(int __uid, int __src_x, int __src_y, int __count = 0);
Arguments
__uid
specifies one unique ID of the barrier.__uid
should not be the same as any one address in the system.__src_x
and__src_y
specify the source address.__count
specifies the number of threads for the barrier. If__count
is greater than 0, the number of processes is overridden when the barrier overflows.
Return value
- APIs for CPU return the result of the operation.
- 0 means the transmission operation succeeds.
- 1 means the transmission operation fails.
- APIs for GPU return one value of
cudaError_t
.
Launch
When several masters share a computation resource, masters should send launch requests to the shared computation resource to start tasks.
The program on the master calls launch
to launch one task on the shared computation resource. launch
blocks the program on the master until the shared computation resource has been triggered by this request.
The program on the shared computation resources calls waitlaunch
to get the launcher. waitlaunch
blocks the task until one master triggers the task.
The source and destination addresses are the same among launch
and waitlaunch
.
APIs for CPU
syscall_return_t launch(int64_t __dst_x, int64_t __dst_y, int64_t __src_x, int64_t __src_y);
syscall_return_t waitLaunch(int64_t __dst_x, int64_t __dst_y, int64_t* __src_x, int64_t* __src_y);
APIs for CUDA
cudaError_t launch(int __dst_x, int __dst_y, int __src_x, int __src_y);
cudaError_t waitLaunch(int __dst_x, int __dst_y, int* __src_x, int* __src_y);
Arguments
__dst_x
and__dst_y
specify the destination address.__src_x
and__src_y
specify the source address.
Return value
-
waitLaunch
returns the source of the launch command through__src_x
and__src_y
. -
APIs for CPU return the result of the operation.
- 0 means the transmission operation succeeds.
- 1 means the transmission operation fails.
- APIs for GPU return one value of
cudaError_t
.
TODO: A more flexible way to specify the source and the destination address.
Example
//
// master (0,1)
//
...
// Launch task on the slave.
launch(0, 0, 0, 1);
// Send data to the slave.
sendMessage(0, 0, 0, 1, src_data, 1024);
// Wait and receive data from the slave.
receiveMessage(0, 0, 0, 1, dst_data, 8);
...
//
// slave (0, 0)
//
...
// Wait launcher
int64_t src_x = -1, src_y = -1;
waitlaunch(0, 0, &src_x, &src_y);
// Receive data from the master.
receiveMessage(0, 0, 0, 1, src_data, 1024);
// Run task.
...
// Send data to the master.
sendMessage(0, 0, 0, 1, dst_data, 8);
...
API Declaration and Implementation
APIs for CPU
The declaration of APIs for CPUs is provided in $SIMULATOR_ROOT/interchiplet/includes/apis_c.h
. The implementation of these APIs is compiled into a static library $SIMULATOR_ROOT/interchiplet/lib/libinterchiplet_c.a
, which should be linked to the benchmark.
APIs on CPUs are implemented by system calls. System Calls can have a number of arguments. The mapping between APIs and system calls is listed below:
API | System Call ID |
---|---|
launch |
SYSCALL_LAUNCH |
waitLaunch |
SYSCALL_WAITLAUNCH |
lock |
SYSCALL_LOCK |
unlock |
SYSCALL_UNLOCK |
barrier |
SYSCALL_BARRIER |
sendMessage |
SYSCALL_REMOTE_WRITE |
receiveMessage |
SYSCALL_REMOTE_READ |
APIs for CUDA
APIs on CUDA platforms are implemented by built-in CUDA APIs. The declaration of these APIs is provided in $SIMULATOR_ROOT/interchiplet/includes/apis_cu.h
. The implementation of these APIs is provided by the CUDA simulator, like GPGPU-Sim. Hence, when compiling executable files for CUDA platforms, the specified CUDA library should be provided as below:
# CUDA language target
CUDA_target: $(CUDA_OBJS)
$(NVCC) -L$(SIMULATOR_ROOT)/gpgpu-sim/lib/$(GPGPUSIM_CONFIG) --cudart shared $(CUDA_OBJS) -o $(CUDA_TARGET)
TODO: A more flexible way to specify the source and the destination address.