"Speech recognition",
"multi-threading with Kinect",
and "making NtKinect's program as DLL"
are explained in
"NtKinect: How to recognize speech with Kinect V2".,
"NtKinect: How to run Kinect V2 in a multi-thread environment",
"NtKinect: How to make Kinect V2 program as DLL and use it from Unity",
respectively.
In this article, we will explain how to create a DLL file for speech recognition.
It should have the following file structure. There may be some folders and files other than those listed, but ignore them now.
NtKinectDll3/NtKinectDll.sln x64/Release/ NtKinectDll/dllmain.cpp NtKiect.h NtKiectDll.h NtKiectDll.cpp stdafx.cpp stdafx.h targetver.h |
Copy KinectAudioStream.cpp , KinectAudioStream.h , and WaveFile.h to the folder where the project source files are located , NtKinectDll3/NtKinectDll/ in this example.
[Notice] KinectAudioStream.cpp is slightly modified to include stdafx.h.The list of the folder is as follows.
NtKinectDll3/NtKinectDll.sln x64/Release/ NtKinectDll/dllmain.cpp NtKiect.h NtKiectDll.h NtKiectDll.cpp stdafx.cpp stdafx.h targetver.h KinectAudioStream.cpp KinectAudioStream.h WaveFile.h |
"Solution Explorer" -> Right-click over the "Source Files" -> "Add" -> "Add existing item" -> select "KinectAudioStream.cpp"
"Solution Explorer"-> Right-click over the "Header File" -> "Add" -> "Add existing item" -> select "KinectAudioStream.h"
"Solution Explorer" -> Right-click over "header file" -> 「追加」 -> 「既存の項目の追加」 -> WaveFile.h を選択する
Three files have been added to the project. NtKinect.h has been added to the project from the beginning.
"Configuration Properties" -> "VC++ directory" -> "Include directory" -> Add
$(ProgramW6432)\Microsoft SDKs\Speech\V11.0\Includeto the first place.
"Configuration Properties" -> "VC++ directory" -> "Library directory" -> Add
$(ProgramW6432)\Microsoft SDKs\Speech\V11.0\Libto the first place.
"Configuration Properties" -> "Linker" -> "General" -> "Input" -> Add.
sapi.lib
The following two libraries should already be set up, but please confirm.
Kinect20.lib opencv_world310.lib
The green letter part is related to import/export of the DLL defined since the project was created. When NtKinectDll.h is loaded, it becomes a declaration for export within this project and it becomes a declaration for import in other projects.
The blue letter part is related to the functions defined by ourselves. It is called mangling that the function name is changed by the C++ compiler to the name including return value type and argument type. In forder to avoid mangling c++ function names, declare function prototype in extern "C" {}. This makes it possible to use this DLL from other languages.
In order to avoid name conflicts, we define a namespace NtKinectSpeech and declare the function prototypes and variables.
Speech recognition is always done in another thread, and the recognized result is saved in the speechQueue . When accessing the queue, mutex is used for exclusive control of the thread.
NtKinectDll.h |
#ifdef NTKINECTDLL_EXPORTS #define NTKINECTDLL_API __declspec(dllexport) #else #define NTKINECTDLL_API __declspec(dllimport) #endif #include <mutex> #include <list> #include <thread> namespace NtKinectSpeech { extern "C" { NTKINECTDLL_API void* getKinect(void); NTKINECTDLL_API void initSpeech(void* kinect); NTKINECTDLL_API void setSpeechLang(void* kinect,wchar_t*,wchar_t*); NTKINECTDLL_API int speechQueueSize(void* kinect); NTKINECTDLL_API int getSpeech(void* kinect,wchar_t*& tagPtr,wchar_t*& itemPtr); NTKINECTDLL_API void destroySpeech(void* kinect); } std::mutex mutex; std::thread* speechThread; std::list<std::pair<std::wstring,std::wstring>> speechQueue; bool speechActive; #define SPEECH_MAX_LENGTH 1024 wchar_t tagBuffer[SPEECH_MAX_LENGTH]; wchar_t itemBuffer[SPEECH_MAX_LENGTH]; } |
"NTKINECTDLL_API" must be written at the beginning of the function declaration. This is a macro defined in NtKinectDll.h to facilitate export/import from DLL.
In the DLL, the object must be allocated in the heap. For this reason, the void *getKinect() function allocates NtKinect instance in the heap memory and returns the casted pointer to it.
When executing a function of the DLL, the pointer to the NtKinect Object is given as an argument of type (void *). We cast it to a pointer of (NtKinect *) type and use NtKinect's function via it. For example, access to a member function "acquire()" is described as (*kinect).acquire().
On Unity (C#) side, the data is managed and may be moved by Gabage Collector. Be careful to exchange data between C# and C++.
The first argument's of NtKinect's setSpeechLang() function is string type, and the second argument is wstrnig type. When a string of Unity (C#) is passed to the DLL (C++) function, its type is WideCharacter (UTF16) and it is used as wstring (UTF16) directly in DLL (C++) side. When you need string type data, you must convert it from wstring (UTF16) to strnig (UTF8) in C++ using WideCharToMultiByte() function. In the definition of void setSpeechLang(void*, wchar_t*, wchar_t*) function in NtKinectDll.cpp, it findsthe number of bytes of the converted characters by the first WideCharToMultiByte() call, convert the characters from UTF16 to UTF8 and write it in langBuffer by the second call.
When passing C++ wstring data to Unity (C#), you need to allocate the area of w_char data on the heap and pass the address to the area. From the getSpeech() function, two wstring, tag and item of the recognized word will be returned. C# will pass the two address reference as arguments to the function, and to the reference C++ write the address to the w_char area in heap memory.
NtKinectDll.cpp |
#include "stdafx.h" #include "NtKinectDll.h" #define USE_THREAD #define USE_SPEECH #include "NtKinect.h" using namespace std; namespace NtKinectSpeech { NTKINECTDLL_API void* getKinect(void) { NtKinect* kinect = new NtKinect; return static_cast<void*>(kinect); } NTKINECTDLL_API void setSpeechLang(void* ptr, wchar_t* wlang, wchar_t* grxmlBuffer) { NtKinect *kinect = static_cast<NtKinect*>(ptr); if (wlang && grxmlBuffer) { int len = WideCharToMultiByte(CP_UTF8,NULL,wlang,-1,NULL,0,NULL,NULL) + 1; char* langBuffer = new char[len]; memset(langBuffer,'\0',len); WideCharToMultiByte(CP_UTF8,NULL,wlang,-1,langBuffer,len,NULL,NULL); string lang(langBuffer); wstring grxml(grxmlBuffer); (*kinect).acquire(); (*kinect).setSpeechLang(lang,grxml); (*kinect).release(); } } void speechThreadFunc(NtKinect* kinect) { ERROR_CHECK(CoInitializeEx(NULL,COINIT_MULTITHREADED)); (*kinect).acquire(); (*kinect).startSpeech(); (*kinect).release(); while (speechActive) { pair<wstring,wstring> p; bool flag = (*kinect)._setSpeech(p); if (flag) { mutex.lock(); speechQueue.push_back(p); mutex.unlock(); } std::this_thread::sleep_for(std::chrono::milliseconds(10)); } } NTKINECTDLL_API void initSpeech(void* ptr) { NtKinect *kinect = static_cast<NtKinect*>(ptr); speechActive = true; speechThread = new std::thread(NtKinectSpeech::speechThreadFunc, kinect); return; } NTKINECTDLL_API int speechQueueSize(void* ptr) { int n=0; mutex.lock(); n = (int) speechQueue.size(); mutex.unlock(); return n; } NTKINECTDLL_API int getSpeech(void* ptr, wchar_t*& tagPtr, wchar_t*& itemPtr) { NtKinect *kinect = static_cast<NtKinect*>(ptr); wmemset(tagBuffer,'\0',SPEECH_MAX_LENGTH); wmemset(itemBuffer,'\0',SPEECH_MAX_LENGTH); pair<wstring,wstring> p; mutex.lock(); bool empty = speechQueue.empty(); if (! empty) { p = speechQueue.front(); speechQueue.pop_front(); } mutex.unlock(); if (!empty) { wsprintf(tagBuffer,L"%ls",p.first); wsprintf(itemBuffer,L"%ls",p.second); } tagPtr = tagBuffer; itemPtr = itemBuffer; return !empty; } NTKINECTDLL_API void destroySpeech(void* ptr) { speechActive = false; speechThread->join(); NtKinect *kinect = static_cast<NtKinect*>(ptr); (*kinect).acquire(); (*kinect).stopSpeech(); (*kinect).release(); delete speechThread; CoUninitialize(); } } |
[Caution](Oct/07/2017 added) If you encounter "dllimport ..." error when building with Visual Studio 2017 Update 2, please refer to here and deal with it to define NTKINECTDLL_EXPORTS in NtKinectDll.cpp.
Since the above zip file may not include the latest "NtKinect.h", Download the latest version from here and replace old one with it.
Let's write a Unity program using the above DLL file to recognize speech with Kinect V2.
The sapi.lib is used for speech recognition, and is usually at the folder "C:\Program Files\Microsoft SDKs\Speech\v11.0\Lib". It seems that the library only has to be installed in the Windows10 environment, and there is no need to put a copy under the Unity's individual project.
From the menu at the top, "Game Object"-> "3D Object" -> "Cube"
From the menu at the top, "Assets" -> "Create" -> "C# Script" -> Filename is CubeBehaviour
C++ pointers are treated as System.IntPtr in C#.
The string of Unity (C#) is managed, and the string and wstring of DLL (C++) are unmanaged. When passing Unity (C#) string as an argument to a function of DLL (C++), convert it to unmanaged UTF16 string on heap memory using Marshal.StringToHGlobalUni() function. This operation eliminates the danger that data may be moved by the garbase collector of C#. When the unmanaged data becomes unnecessary, you must call Marshal. FreeHGlobal() function to free the memory.
In order to return two strings (UTF16) recognized on C++ side to C# side, reference to pointer with ref declaration is passed as arguments. Since the memory reference allocated by C++ side is returned, C# side immediately converts it to the managed string with Marshal. PtrToStringUni() function.
The character encoding of the string (= System.String) of Unity is UTF16 .
CubeBehaviour.cs |
using UnityEngine; using System.Collections; using System; using System.Runtime.InteropServices; public class CubeBehaviour : MonoBehaviour { [DllImport ("NtKinectDll")] private static extern IntPtr getKinect(); [DllImport ("NtKinectDll")] private static extern void initSpeech(IntPtr kinect); [DllImport ("NtKinectDll")] private static extern void setSpeechLang(IntPtr kinect,IntPtr lang,IntPtr grxml); [DllImport ("NtKinectDll")] private static extern int getSpeech(IntPtr kinect,ref IntPtr tagPtr,ref IntPtr itemPtr); [DllImport ("NtKinectDll")] private static extern void destroySpeech(IntPtr kinect); private IntPtr kinect; void Start () { kinect = getKinect(); IntPtr lang = Marshal.StringToHGlobalUni("ja-JP"); // "en-US" IntPtr grxml = Marshal.StringToHGlobalUni("Grammar_jaJP.grxml"); // "Grammar_enUS.grxml" setSpeechLang(kinect,lang,grxml); initSpeech(kinect); Marshal.FreeHGlobal(lang); Marshal.FreeHGlobal(grxml); } void Update () { IntPtr tagPtr = (IntPtr)0; IntPtr itemPtr = (IntPtr)0; int flag = getSpeech(kinect,ref tagPtr,ref itemPtr); string speechTag = Marshal.PtrToStringUni(tagPtr); string speechItem = Marshal.PtrToStringUni(itemPtr); if (flag > 0) { Debug.Log("tag = "+speechTag); Debug.Log("item = "+speechItem); } if (flag>0 && speechTag.CompareTo("RED")==0) { gameObject.GetComponent<Renderer>().material.color = new Color(1.0f, 0.0f, 0.0f, 1.0f); } else if (flag>0 && speechTag.CompareTo("GREEN")==0) { gameObject.GetComponent<Renderer>().material.color = new Color(0.0f, 1.0f, 0.0f, 1.0f); } else if (flag>0 && speechTag.CompareTo("BLUE")==0) { gameObject.GetComponent<Renderer>().material.color = new Color(0.0f, 0.0f, 1.0f, 1.0f); } else if (flag>0 && speechTag.CompareTo("EXIT")==0) { Application.Quit(); UnityEditor.EditorApplication.isPlaying = false; } } void OnApplicationQuit() { destroySpeech(kinect); } } |
You only need one word definition file for speech recognition, but put both Japanese file "Grammar_jaJP.grxml" and English file "Grammar_enUS.grxml" under the Unity's project , so that you can experiment with switchng languages later.
Grammar_jaJP.grxml |
<?xml version="1.0" encoding="utf-8" ?> <grammar version="1.0" xml:lang="ja-JP" root="rootRule" tag-format="semantics/1.0-literals" xmlns="http://www.w3.org/2001/06/grammar"> <rule id="rootRule"> <one-of> <item> <tag>RED</tag> <one-of> <item> 赤 </item> <item> 赤色 </item> </one-of> </item> <item> <tag>GREEN</tag> <one-of> <item> 緑 </item> <item> 緑色 </item> </one-of> </item> <item> <tag>BLUE</tag> <one-of> <item> 青 </item> <item> 青色 </item> </one-of> </item> <item> <tag>EXIT</tag> <one-of> <item> 終わり </item> <item> 終了 </item> </one-of> </item> </one-of> </rule> </grammar> |
Grammar_enUS.grxml |
<?xml version="1.0" encoding="utf-8" ?> <grammar version="1.0" xml:lang="en-US" root="rootRule" tag-format="semantics/1.0-literals" xmlns="http://www.w3.org/2001/06/grammar"> <rule id="rootRule"> <one-of> <item> <tag>RED</tag> <one-of> <item> Red </item> </one-of> </item> <item> <tag>GREEN</tag> <one-of> <item> Green </item> </one-of> </item> <item> <tag>BLUE</tag> <one-of> <item> Blue </item> </one-of> </item> <item> <tag>EXIT</tag> <one-of> <item> Exit </item> <item> Quit </item> <item> Stop </item> </one-of> </item> </one-of> </rule> </grammar> |
[Notice] We generates an OpenCV window in DLL to display the skeleton recognition state. Note that when the OpenCV window is focused, that is, when the Unity game window is not focused, the screen of Unity will not change. Click on the top of Unity's window and make it focused, then try the program's behaviour.
Edit CubeBehaviour.cs and change as the following.
from | to |
---|---|
"ja-JP" | "en-US" |
"Grammar_jaJP.grxml" | "Grammar_enUS.grxml" |